CN108846076A

CN108846076A - The massive multi-source ETL process method and system of supporting interface adaptation

Info

Publication number: CN108846076A
Application number: CN201810588231.7A
Authority: CN
Inventors: 史玉良; 王新军; 张晖; 管永明; 吕梁; 刘智勇
Original assignee: DAREWAY SOFTWARE Co Ltd
Current assignee: DAREWAY SOFTWARE Co Ltd
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2018-11-20

Abstract

The invention discloses the massive multi-source ETL process method and system of supporting interface adaptation.Including：The essential information of data source and target database is arranged in data pick-up step, adaptively matches corresponding ETL tool for different data sources, and carry out parameter setting to ETL tool；Data conversion step completes the execution of ETL Operation control and management and running, carries out buffer-stored and management to the data extracted, and cleaning and conversion for completing data etc. is handled；Data object after conversion is carried out quality examination, and exported according to the table structure of data model definitions by data load step, and the data update after checking and finding correct is loaded onto target database；Data monitoring step is monitored management to ETL job execution process, operation resource service condition and running situation.Suitable ETL tool is adaptively matched, and realizes the extraction and conversion of mass data, realizes efficient execution and the orderly management of ETL operation.

Description

The massive multi-source ETL process method and system of supporting interface adaptation

Technical field

The present invention relates to ETL management domain, in particular to a kind of massive multi-source ETL process side of supporting interface adaptation Method and system.

Background technique

Industry has accumulated mass data at present, and capacity, type and the variation of data are all sharply increasing, but big data is not yet Make full use of, wherein the immense value contained have it is to be excavated.Big data often has multi-source heterogeneous characteristic, from it is different, point Scattered operation system, there are the multiple types such as structural data, semi-structured data, unstructured data, it is difficult to extract and turn Change required data into.Under big data environment, data show large capacity, Suresh Kumar, interact the features such as frequent, with acquisition Data are continuously increased, and data process method is gradually complicated, and be faced with massive multi-source data disparate databases it Between efficiency of transmission problem.

Traditional ETL tool is expensive, very high to specific business dependence, and is centralized architecture, that is, designs, transports Row management all concentrates on a server, and the requirement to hardware is very high.Under traditional ETL management mode, generally according to source The attribute of database and target database, it is artificial to determine ETL tool, and ETL flow of task, setting parameter, starting task are set, Such artificial ETL management mode process is complicated, consumes a large amount of manpower and time, and be unable to satisfy massive multi-source data ETL job requirements.Therefore needs exploration can more economical, more efficiently execute ETL under big data environment and (extract, conversion, adds Carry) operation device.

Summary of the invention

The object of the invention is to solve the above problems, proposing a kind of massive multi-source number of supporting interface adaptation Interface adapter and ETL are based on for the massive multi-source data from different, dispersion system according to ETL method and system Suitable ETL tool is adaptive selected in tools engine, and based on big datas processing techniques such as HDFS, MapReduce, Spark Realize that the centrally stored and processing of the management of ETL job scheduling and efficiently execution and magnanimity complex data is converted.

To achieve the goals above, the present invention adopts the following technical scheme that：

As the first aspect of the present invention, the massive multi-source ETL process method of supporting interface adaptation is provided；

The massive multi-source ETL process method of supporting interface adaptation, including：

The essential information of data source and target database is arranged in data pick-up step, be different data sources adaptively Parameter setting is carried out with corresponding ETL tool, and to ETL tool；It is connect by database interface, journal file interface or flow data Mouth extracts different data sources；

Data conversion step completes ETL Operation control based on MapReduce and Spark Computational frame and executes and dispatch pipe Reason carries out buffer-stored and management to the data extracted based on HDFS, Hive or HBase, and completes the cleaning of data and turn It changes；

Data object after conversion is carried out quality examination by data load step, and according to data model definitions Table structure output, the data update after checking and finding correct are loaded onto target database；

Monitoring management step is monitored ETL job execution process, operation resource service condition and running situation Management.

As a further improvement of the present invention, the data pick-up step, including：

Data source and object library sub-step are set, the essential information of data source and target database is set, including：Database Connection type, database IP between type, data source and target database, database-name, port, user name, password；

Adaptive matching ETL tool sub-step, for the adaptive corresponding ETL tool of matching of different data sources.

In the Adaptive matching ETL tool sub-step, if data source or target database are database data, if having One side is non-relational database HDFS, then adaptively matches ETL tool Sqoop；Otherwise adaptive matching ETL tool Kettle；If data source is journal file, ETL tool Flume is adaptively matched；If data source is flow data, Adaptively match ETL tool Kafka.

ETL tool parameters configure sub-step, set after the completion of ETL tool matching, task parameters.

As a further improvement of the present invention, the data source, including：Database data, picture, audio file, video File, journal file or flow data；Wherein, database data includes：Relevant database and non-relational database；Relationship type Database, including：Oracle,MySQL,SQL Server；Non-relational database, including：HDFS,MongoDB,HBase.Day Will file includes：From console (console), RPC (Thrift-RPC), text (file), tail (UNIX tail), The various types and format of syslog (syslog log system supports 2 kinds of modes such as TCP and UDP), exec (order executes) etc. Daily record data.

As a further improvement of the present invention, the target database realizes data sharing, report query, system application.

As a further improvement of the present invention, the ETL tool, including：Sqoop, Kettle, Flume or Kafka, In, Sqoop is a Open-Source Tools, for carrying out data biography between Hadoop and traditional database (Oracle, MySQL etc.) It passs；Kettle is a open source ETL tool, realizes data pick-up by core of workflow；Flume is the sea that Cloudera is provided The system for measuring log collection, polymerization and transmission；Kafka is the open source stream process platform an of high-throughput.

As a further improvement of the present invention, the data conversion step, including：

Operation process design sub-step refers to according to actual service logic design project control flow, including extract mode and ETL flow of task.

Job scheduling manages sub-step, including：Job scheduling strategy, job dependence control, job priority configuration, operation Scheduling controlling, wherein job scheduling strategy includes time trigger, event triggering and immediate processing mode；Job dependence controls Refer to the dependence formulated between operation according to actual service logic；Job priority configuration refers to according to actual service logic and is Resource service condition of uniting formulates the priority of operation；Job scheduling control refers to setting job scheduling resource threshold value of warning, is providing When source uses more than threshold value, the low operation of pause priority.

Job execution sub-step is responsible for the execution of ETL operation.

In the job execution sub-step, Sqoop starting only have map MapReduce operation, according to data cutting value by Row reads data；Kettle establishes conversion Transformation and task Job, after each link task parameters are arranged, starts work Make process and carries out data pick-up；Flume is placed in channel component and is delayed by its source collect components daily record data It deposits, and destination is sent data to by sink component；Kafka collect resolve into after flow data a series of batch processing jobs by Distributed elastic data set in Spark is handled in real time.

Distributed caching sub-step carries out buffer-stored to the data of extraction, and wherein HDFS is responsible for the storage of bottom data, Hive is responsible for the filtering of data, summarizes, inquires, analyzing, and HBase is responsible for the change maintenance of data, calculates in data conversion Journey is written infrequently the data taken and is stored；

Business rule formulates sub-step, according to practical business rule, formulates the business rule of data cleansing, conversion；

Data processing sub-step completes the cleaning and conversion of data, wherein data cleansing is complete according to the business rule of formulation Filling a vacancy, correct and cleaning at data, data conversion complete the inconsistent conversion of data, data granularity conversion and standard handovers.

The inconsistent conversion：For example the same user is A01 in A system coding, is encoded to B01 in B system, it is such Data pick-up is uniformly converted into a coding after coming；

The data granularity conversion：Data information as user M is stored in A system is very detailed, stores in B system Data information it is then relatively simple, granularity is different, it is decimated come after need to polymerize its granularity；

The standard handovers：Such as business datum, in operation system A and system B due to difference of business rule etc., It has different standards in two systems, needs to seek unity of standard after extraction.

As a further improvement of the present invention, data load step, including：

The quality of data checks sub-step, the data object after conversion is carried out quality examination, to due to network interruption Data exception problem caused by reason is verified, and checks whether the quality of data converted meets the mark of target database It is quasi-；

Data update load sub-step, will be loaded into target database by the data to check and find correct, according to fixed in advance The good data model of justice updates target matrix in such a way that timestamp, log sheet, full table compare, full table is deleted or insertion.

As a further improvement of the present invention, monitoring management step, including：

Monitoring operation manages sub-step, and implementation procedure and resource service condition to ETL operation are monitored；

The ETL job execution process monitoring sub-step, to include the job execution time, operation progress situation, whether surpass When, job interruption, the information such as job stacking are monitored.The job execution time is monitored, time-out is set and is reminded, and by artificial judgment Analyze job timeout's problem；Job execution log information is monitored, when there is job interruption, according to the interruption Restoration Mechanism of formulation, Retriggered job execution；Operation progress situation is monitored, when there is job stacking, is lined up according to job priority, it is preferential to execute The high operation of rank.

The ETL operation monitoring resource sub-step, is monitored the service condition of operation resource, if resource load is more than Carry out adjustment of load when threshold value, pause or the low operation of stop section priority wait load to be down to threshold value or less and execute work again Industry；

System monitoring manages sub-step, is monitored to machine hardware information, cluster running state information, and to first number According to, database interface, journal file interface or flow data interface is managed.

As a second aspect of the invention, the massive multi-source ETL process system of supporting interface adaptation is provided；

The massive multi-source ETL process system of supporting interface adaptation, including：

The essential information of data source and target database is arranged in data extraction module, be different data sources adaptively Parameter setting is carried out with corresponding ETL tool, and to ETL tool；It is connect by database interface, journal file interface or flow data Mouth extracts different data sources；

Data conversion module completes ETL job execution and management and running, base based on MapReduce and Spark Computational frame Buffer-stored and management are carried out to the data extracted in HDFS, Hive or HBase, and complete the cleaning and conversion of data；

Data object after conversion is carried out quality examination by data loading module, and according to data model definitions Table structure output, the data update after checking and finding correct are loaded onto target database；

Monitoring management module is monitored ETL job execution process, operation resource service condition and running situation Management.

As a further improvement of the present invention, the data source, including：Database data, picture, audio file, video File, journal file or flow data；Wherein, database data, including：Relevant database and non-relational database；Relationship Type database, including：Oracle, MySQL, SQL Server etc.；Non-relational database, including：HDFS,MongoDB, HBase etc..The journal file, including：From console (console), RPC (Thrift-RPC), text (file), tail (UNIXtail), syslog (syslog log system supports 2 kinds of modes such as TCP and UDP), exec's (order executes) etc. is each The daily record data of seed type and format.

As a further improvement of the present invention, the target database, for realizing data sharing, report query and system Using.

As a further improvement of the present invention, the data extraction module, including：

Interface adapter, data base-oriented data, journal file or flow data different types of data provide data-interface, and Formulate ETL tool adaptation rule；The interface adapter further includes adaptation rule engine；

-4-

The data-interface, including：Database interface, journal file interface and flow data interface, wherein passing through database Interface extracts database data from relevant database or non-relational database；By journal file interface, log is extracted File；By flow data interface, flow data is extracted.

The adaptation rule engine is used to be arranged the essential information of data source and target database, including：Type of database, Data source and target database connection type, database IP, database-name, port, user name, password；Adaptation rule includes The adaptation rule of the decision rule of data source and target database type, difference ETL tool.

ETL tools engine, integrated and manage ETL tool, the ETL tool, including：Sqoop,Flume,Kettle, Kafka, the isomeric data for database data, journal file, flow data from different data sources, Adaptive matching are suitable ETL tool.

Wherein, Sqoop is a Open-Source Tools, for carrying out data transmitting between Hadoop database；

Kettle is a open source ETL tool, realizes data pick-up by core of workflow；

Flume is massive logs acquisition, polymerization and the system transmitted that Cloudera is provided；

Kafka is the open source stream process platform an of high-throughput.

In the ETL tools engine, ETL adaptation engine is according to the data formulated in the adaptation rule engine of interface adapter The adaptation rule of the decision rule of source and target database, difference ETL tool is different data source capability ETL tool.

As a further improvement of the present invention, the data conversion module, including：

Job scheduling management engine, using the host node of distributed type assemblies as management and running engine, including Job Management Sheet Member and task scheduling unit；

The job management unit, including：ETL job design, operation configuration and monitoring operation, wherein ETL job design Refer to according to actual service logic fulfil assignment dependence, whether increment extraction or extract frequency Operation control process and set Meter；Operation configures the configuration for referring to the priority that fulfils assignment, job execution mode parameter；Monitoring operation refers to ETL operation Execution state and resource service condition are monitored management；

The task scheduling unit, including：Job scheduling strategy and job scheduling control, wherein job scheduling strategy packet Include the modes such as time trigger, event triggering and real-time processing；Job scheduling control refers to setting job scheduling resource threshold value of warning, When resource uses more than threshold value, the low operation of pause priority.

Job execution engine is responsible for the execution of ETL operation, is based on using the slave node of distributed type assemblies as enforcement engine MapReduce Computational frame realizes the processed offline of ETL operation, and the real-time place of ETL operation is realized based on Spark Computational frame Reason.

In the job execution engine, Sqoop starting only has the MapReduce operation of map, line by line according to data cutting value Read data；Kettle establishes conversion Transformation and task Job, after each link task parameters are arranged, starts work Process carries out data pick-up；Flume is placed in channel component and is cached by its source collect components daily record data, And destination is sent data to by sink component；Kafka collect resolve into after flow data a series of batch processing jobs by Distributed elastic data set in Spark is handled in real time.

Distributed caching submodule carries out buffer-stored and management to the data of extraction, and wherein HDFS is responsible for bottom data Storage, Hive is responsible for the filtering of data, summarizes, inquires or analyze；HBase is responsible for the change maintenance of data, turns in data It changes calculating process and is written infrequently the data taken and stored；

Business Rule Engine formulates cleaning rule to deficiency of data, wrong data and dirty data according to practical business rule Then, transformation rule is formulated to the isomeric data from different business systems；

Data processing submodule, including：Data cleansing unit and Date Conversion Unit；According to practical business rule, complete The cleaning and conversion of data；

The data cleansing unit completes filling a vacancy, correct and cleaning for data, wherein data are filled a vacancy deficiency of data Missing information and mismatch information carry out completion；Wrong data is modified by data correction according to specific business；Data cleansing Cleaning rule is designed for dirty data and determines its correctness；

The Date Conversion Unit is by the magnanimity M IS from different business systems at needed for target database Data, wherein it is inconsistent conversion by the data of the same type from different business systems carry out unification；Data granularity conversion According to target database granularity, the operation system data extracted are polymerize；Data standard conversion is according to pre-establishing Normal data model will extract data conversion into normal data needed for object library.

As a further improvement of the present invention, data loading module, including：

Data object after conversion is carried out quality examination, to due to network interruption by quality examination submodule Caused by data exception problem verified, and check whether the quality of data that converts meets the standard of target database；

Data update load submodule, will be loaded into target database by the data to check and find correct, according to fixed in advance The good data model of justice updates target matrix in such a way that timestamp, log sheet, full table compare, full table is deleted or insertion.

As a further improvement of the present invention, monitoring management module, including：

Monitoring operation submodule is monitored the implementation procedure of ETL operation, including the job execution time, whether time-out, Job interruption, job stacking, operation progress situation, operation resource service condition；

Load balancing submodule, according to ETL operation execution situation assess resource load situation, load be more than threshold value when into Row adjustment of load, pause or the low operation of stop section priority wait load to be down to threshold value or less and execute operation again；

Metadata management submodule, data of the metadata as description data attribute information save the definition of data source, turn Definition, the implementation procedure definition for changing rule, are managed metadata information involved by ETL operation；

Interface management submodule carries out interface definition, agreement to database interface, journal file interface, flow data interface Adaptation, data encapsulation, support http, ftp, rest, webservice interface protocol.

Run monitoring submodule, monitor and collecting robot hardware information, resource information, load information, cluster component states, Cluster operation information, real-time perception and the operation conditions for analyzing cluster.

The beneficial effects of the invention are as follows：

1. towards the magnanimity structuring from different, dispersion system, the isomeric datas such as semi-structured, unstructured, base In interface adapter and ETL tools engine be the suitable ETL tool of its adaptive matching.

2. high performance cloud ETL managing device carries out ETL based on distributed type assemblies and MapReduce, Spark frame Job scheduling management and execution realize the orderly management of complicated big data ETL operation and efficiently execute.

3. being cached based on distributed system HDFS to mass data, and carried out based on MapReduce and Spark frame Data cleansing and conversion, realize different decentralized system data efficient decimation, it is centrally stored with processing, be beneficial to data sharing With strengthened research.

Detailed description of the invention

The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.

The position Fig. 1 flow chart of the method for the present invention；

Fig. 2 is functional module connection figure of the invention；

Fig. 3 is cloud ETL tool matching flow chart；

Fig. 4 is ETL job execution and management and running flow chart；

Fig. 5 is concrete application embodiment of the invention.

Specific embodiment

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

It is the massive multi-source ETL process method flow diagram of supporting interface adaptation of the present invention with reference to Fig. 1, including following Step：

Process 101：Data source and object library are set, the essential information of data source and target database is set, including：Data Library type, connection type, database IP, database-name, port, user name, password.

Process 102：Adaptive matching ETL tool is the different data sources such as database data, journal file, flow data The ETL tool such as adaptive matching Sqoop, Kettle, Flume, Kafka.

Process 103：The configuration of ETL tool parameters, set after the completion of ETL tool matching, task parameters etc. it is initial Parameter value.

Process 104：Operation process design designs ETL Operation control process, including data pick-up according to actual service logic Mode, and specific ETL work flow corresponding to difference ETL tool.

Process 105：Job scheduling management, to job scheduling strategy, job dependence relationship, job priority, job scheduling Etc. being managed for configuration.

Process 106：Job execution is responsible for the execution of specific ETL operation.

Process 107：ETL monitoring operation management, implementation procedure and resource service condition to ETL operation are monitored.

Process 108：Distributed caching carries out buffer-stored to the data of extraction, and wherein HDFS is responsible for depositing for bottom data Storage, Hive are responsible for the management of data, and HBase is responsible for the change maintenance of data, frequent in data conversion calculating process to those The data of reading are stored.

Process 109：Business rule is formulated, and the business rule of data cleansing, conversion is formulated according to actual service logic.

Process 110：The cleaning and conversion of data, wherein data cleansing are completed in data processing according to the business rule of formulation Filling a vacancy, correct and cleaning for data is completed, data conversion is completed the inconsistent conversion of data, data granularity conversion and standard and turned It changes.

Process 111：Quality of data inspection carries out quality examination to the data after conversion, to due to network interruption etc. Data exception problem caused by reason is verified, and checks whether the quality of data converted meets the mark of target database It is quasi-.

Process 112：Data update load, will be loaded into target database by the data to check and find correct, according to preparatory The data model defined is compared using timestamp, log sheet, full table, full table deletes the modes such as insertion and updates target matrix.

Process 113：System monitoring management is monitored the information such as machine hardware information, cluster operating status, and to member Data, database interface, journal file interface, flow data interface etc. are managed.

It is the massive multi-source ETL process system of supporting interface adaptation of the present invention, including data pick-up mould with reference to Fig. 2 Block, data conversion module, data loading module, monitoring management module.

Data extraction module is made of interface adapter, ETL tools engine.Wherein interface adapter include data-interface and Adaptation rule engine specifies data source and object library, designs decimation rule, completes interface adaptation.ETL tools engine is integrated and is managed The ETL tools such as Sqoop, Flume, Kettle, Kafka are managed, according to the matching pair adaptive to different data of interface adaptation rule The ETL tool answered.

Data conversion module is drawn by job scheduling management engine, job execution engine, distributed caching area, business rule It holds up, data processing submodule composition.

Job scheduling management engine is responsible for the management and task schedule of ETL operation.

Job execution engine is responsible for the execution of specific ETL operation, realizes ETL operation based on MapReduce Computational frame Processed offline realizes the real-time processing of ETL operation based on Spark Computational frame.

Distributed caching area carries out buffer-stored to the data of extraction, and wherein HDFS is responsible for the storage of bottom data, Hive It is responsible for the management of data, HBase is responsible for the change maintenance of data, is stored in data conversion calculating process and is written infrequently the number taken According to；

Business Rule Engine formulates the business rule of data cleansing, conversion according to actual service logic；

Data processing submodule completes the cleaning and conversion of data according to practical business rule；

Data loading module updates submodule by quality examination submodule, load and forms, and is responsible for the data converted It is checked, and loads and be updated to target database.

Monitoring management module, including：Monitoring operation submodule, load balancing submodule, metadata management submodule, interface Manage submodule, operation monitoring submodule.

Monitoring operation submodule is monitored the implementation procedure of ETL operation, handles operation abnormal conditions；

Load balancing submodule is assessed resource load situation according to operation execution situation, is loaded by load migration Adjustment, realizes the lasting maximization of the utilization of resources；

Metadata management submodule carries out pipe to metadata such as data source definitions, transformation rule definition, implementation procedure definition Reason；

Interface management submodule, to the interfaces such as database interface, journal file interface carry out interface definition, protocol adaptation, Data encapsulation；

With reference to Fig. 3, it is cloud ETL tool matching flow chart of the present invention, includes the following steps：

Process 301：User specifies data source and object library, and database information, including type of database, connection class is arranged Type, database IP, database-name, port, user name, password etc..

Process 302：According to data source and object library type, ETL tool is adaptively matched, if database data, is turned To process 203；If log file data, process 206 is gone to；If flow data, process 207 is gone to.

Process 303：Process 204 is gone to if it is HDFS that data source and object library, which have a side, for database data；Its He then goes to process 205 at type.

Process 304：For the data pick-up between HDFS and other databases, ETL tool Sqoop is adaptively matched.

Process 305：For between the relational datas such as Oracle, MySQL, and with the non-relationals data such as MongoDB The data pick-up in library adaptively matches ETL tool Kettle.

Process 306：For log file data, ETL tool Flume is matched, acquisition comes from console (console), RPC (Thrift-RPC), (syslog log system supports TCP and UDP etc. 2 by text (file), tail (UNIX tail), syslog Kind of mode), the daily record datas of the various types of exec (order executes) etc. and format.

Process 307：For flow data, ETL tool Kafka is matched, the flow data high to requirement of real-time is acquired number According to.

Process 308：ETL tool matching is completed.

With reference to Fig. 4, it is ETL job execution and management and running flow chart of the present invention, includes the following steps：

Process 401：According to the objectives and tasks of data pick-up, actual service logic is combed.

Process 402：ETL Operation control process is created, the execution process of ETL operation, packet are designed according to actual service logic It includes job dependence relationship, whether be increment extraction, extraction frequency etc..Wherein Kettle is to visualize rapid configuration, Sqoop, The ETL tool such as Flume, Kafka needs write order to execute.

Process 403：ETL job scheduling strategy, including the triggering of time trigger, event and processing in real time are set.

Process 404：ETL job priority is set, is configured according to priority of the actual service logic to ETL operation, So as to the high operation of the priority scheduling priority in inadequate resource.

Process 405：ETL job scheduling control strategy is set, job scheduling resource threshold value of warning is set, super using resource When crossing threshold value, the low ETL operation of pause priority.

Process 406：Select job execution mode, including locally execute, remotely execute, cluster execute etc. modes.

Process 407：Operation is executed, ETL job execution is started according to operation configuration parameter.Sqoop starting only has map's MapReduce operation reads data according to data cutting value line by line；Kettle establishes conversion Transformation and task Job, after each link task parameters are arranged, starting workflow carries out data pick-up；Flume passes through its source collect components day Will data are placed in channel component and are cached, and send data to destination by sink component；Kafka collects stream A series of batch processing jobs are resolved into after data to be handled in real time by the distributed elastic data set in Spark.

Process 408：ETL job execution state is monitored, suspends Partial Jobs when memory source is more than threshold value, waits until money Operation is executed when below the near threshold value in source again, realizes load balancing.

Process 409：Whether Inspection execution is completed, if all executing completion, terminates.

It is the embodiment of the present invention with reference to Fig. 5, including data source, data extraction module, data conversion module, data cloud are put down The modules such as platform.

Power information acquisition system, sales service application system, life of the power information big data cloud platform from dispersion everywhere It produces and extracts subscriber profile data, acquisition in the peripheral systems such as management platform, the hot multiple-in-one system of electricity-water-gas, preposition communications platform The multiple types of data such as data, statistical query data, management data, communication data, monitoring data carry out unified store and analyze, and are The strengthened research of power information big data provides data and supports, and the flow datas such as monitoring information of acquisition server in real time, it is ensured that The stable operation of platform.

Data source：The flow data of database data and Platform Server including peripheral system.Wherein power information is adopted The database of the peripheral systems such as collecting system is mostly oracle database and HDFS distributed file system, the basis including structuring Data and document, picture, audio-video etc. are semi-structured, unstructured data.Distributed storage, access effect based on HDFS The high characteristic of rate, power information big data is stored in HDFS more in the peripheral systems such as power information acquisition system, is covered big The structural datas such as partial base profile data, acquisition data, statistical query data, monitoring data and document, picture, Audio-video etc. is semi-structured, unstructured data；Part basis file data is stored in oracle database, and uses frequency The lower statistical data analysis of rate.The difference of access efficiency based on HDFS and oracle database institute storing data type, from Semi-structured, the unstructured data that most structural data and whole are extracted in HDFS, from oracle database Extract the unexistent part basis file data of HDFS and statistical data analysis.It is specific as shown in table 1：

The data source information of 1 power information big data cloud platform of table

Data extraction module is made of database interface, adaptation engine, ETL tools engine.

Wherein oracle database and HDFS of the database interface towards peripheral systems such as power information acquisition systems provide Database interface, the flow data that object platform server extracts provide flow data interface；

Adaptation engine provides ETL tool adaptation rule towards power information big data, the judgement rule including data source types Then, the decision rule of type of database, the matching rule of ETL tool.

ETL tools engine is that the different types of data from different data sources matches corresponding ETL work according to adaptation rule Tool, first determines whether the type of data source, if the real-time acquisition of Platform Server flow data, then matches Kafka；

If the data from peripheral system database, then further type of database is judged, acquire for power information Oracle number in the peripheral systems such as the hot multiple-in-one system of system, sales service application system, production management platform, electricity-water-gas According to storehouse matching ETL tool Kettle；For power information acquisition system, the hot multiple-in-one system of electricity-water-gas, preposition communications platform HDFS in equal peripheral systems then matches ETL tool Sqoop；Set after the completion of ETL tool matching, task parameters, The initial parameter values such as database connection, user name, password, permission.

Data conversion module is realized based on distributed type assemblies, is drawn by Operation control engine, job scheduling engine, job execution It holds up, data processing loading module composition.

Operation control engine in the data conversion module is patrolled according to the practical business of power information big data cloud platform Design ETL Operation control process is collected, the design cycle mode of different ETL tools is different, as shown in table 2：

The setting of 2 ETL Operation control process of table

Job scheduling engine in the data conversion module completes the job priority that power information big data extracts task Grade configuration, job scheduling and monitoring, scheduling strategy are as shown in table 3：

3 ETL job scheduling strategy of table

Job execution engine in the data conversion module is according to established power information big data ETL Operation control Process, which specifically to execute ETL, extracts operation.Wherein Sqoop starting only has the MapReduce operation of map, is joined according to data cutting Data are carried out cutting by numerical value, and the region cut out is assigned in different map, and each map is according in data capsule The metadata information of storage, reads data from HDFS line by line.Data pick-up based on Kettle is mainly established Transformation (conversion) and Job (task) realize, wherein Transformation can be by the visual design by each ring Section is added in main window, realizes the connection between each link, the entitled .ktr of the file extent of formation.Job is based on workflow mould Type executes designed convert task, the entitled .kjb of file extent.Flow data batch processing job is then based on SparkStreaming It is handled, flow data is resolved into a series of short and small batch processing jobs at set intervals first, then will be criticized Operation changing is handled into the elasticity distribution formula data set RDDS in Spark, data processing is carried out by RDDs, including mapping (map), It filters (filter), polymerize (join) and grouping (group by) etc..

Data processing loading module is based on distributed computing framework MapReduce and Spark and is realized, including data are slow Storing module, data cleansing module, data conversion module, data loading module.

Data cache module in the data processing loading module is made of HDFS, Hive, HBaase, to what is extracted Power information big data is cached, and wherein HDFS is responsible for data storage, and Hive is responsible for data management, and HBase is responsible for data change More safeguard.

Data cleansing module in the data processing loading module, filling a vacancy including power information big data, correct and Cleaning, wherein data fill a vacancy the missing information of deficiency of data and mismatch information progress completion, and data correction will be in data Do not have progress logic judgment that wrong data caused by database is just written when reception to be modified according to specific business, data cleansing needle Cleaning rule is designed to dirty data, and determines its correctness.

Data conversion module in the data processing loading module, the inconsistent conversion of completion power information big data, Data granularity and standard handovers, inconsistent conversion complete the whole of data according to power information big data feature and business rule It closes, the data of the same type from different business systems is subjected to unification；Data granularity conversion then will be according to target data storehouse Operation system data polymerize by library granularity；The mark that data standard conversion is then formulated according to power information big data cloud platform Quasi- data model, by the data conversion extracted at normal data needed for platform.Data translation tasks be based on MapReduce and Spark is realized, wherein the off-line calculation processing of magnanimity electricity consumption big data is realized based on MapReduce frame, it is parallel based on Spark The real-time calculation processing of computing architecture realization big data.

The data cleaned, converted are loaded by the data loading module in the data processing loading module uses telecommunications Cease big data cloud platform.

Data cloud platform, that is, data load object library, is uniformly stored using HDFS distributed file system from dispersion The different types of magnanimity power information data that data source is extracted are divided into storage facility located at processing plant and analysis library by data type.

Power information big data cloud platform it is integrated using cloud ETL managing device and effectively manage Sqoop, Kettle, The ETL tool such as Kafka, for the Oracle number of the peripheral systems such as power information acquisition system, the preposition communications platform of dispersion everywhere According to the corresponding ETL tool of the flow data Adaptive matching of library, HDFS and server, and it is based on distributed computing framework MapReduce and parallel processing Computational frame Spark complete magnanimity structuring, semi-structured, unstructured data extraction with Conversion can be realized efficient execution and the orderly management of ETL operation, realize the centrally stored of magnanimity power information big data and place Reason is beneficial to be realized data sharing based on the mass data in power information big data cloud platform and carries out strengthened research.

The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims

1. the massive multi-source ETL process method of supporting interface adaptation, characterized in that including：

The essential information of data source and target database is arranged in data pick-up step, adaptively matches phase for different data sources The ETL tool answered, and parameter setting is carried out to ETL tool；It is taken out by database interface, journal file interface or flow data interface Take different data sources；

Data conversion step completes the execution of ETL Operation control and management and running, base based on MapReduce and Spark Computational frame Buffer-stored and management are carried out to the data extracted in HDFS, Hive or HBase, and complete the cleaning and conversion of data；

Data object after conversion is carried out quality examination by data load step, and according to the table knot of data model definitions Structure output, the data update after checking and finding correct are loaded onto target database；

Monitoring management step is monitored pipe to ETL job execution process, operation resource service condition and running situation Reason.

2. the massive multi-source ETL process method of supporting interface adaptation as described in claim 1, characterized in that the number According to extraction step, including：

Data source and object library sub-step are set, the essential information of data source and target database is set, including：Class database Connection type, database IP between type, data source and target database, database-name, port, user name, password；

Adaptive matching ETL tool sub-step, for the adaptive corresponding ETL tool of matching of different data sources；

In the Adaptive matching ETL tool sub-step, if data source or target database are database data, if there is a side For non-relational database HDFS, then ETL tool Sqoop is adaptively matched；Otherwise adaptive matching ETL tool Kettle；If data source is journal file, ETL tool Flume is adaptively matched；If data source is flow data, Adaptively match ETL tool Kafka；

3. the massive multi-source ETL process method of supporting interface adaptation as described in claim 1, characterized in that the number According to switch process, including：

Operation process design sub-step refers to according to actual service logic design project control flow, including extracts mode and ETL Business process；

Job scheduling manages sub-step, including：Job scheduling strategy, job dependence control, job priority configuration, job scheduling Control, wherein job scheduling strategy includes time trigger, event triggering and immediate processing mode；Job dependence control refers to root The dependence between operation is formulated according to actual service logic；Job priority configuration refers to according to actual service logic and system money The priority of source service condition formulation operation；Job scheduling control refers to setting job scheduling resource threshold value of warning, makes in resource When with being more than threshold value, the low operation of pause priority；

Job execution sub-step is responsible for the execution of ETL operation；

In the job execution sub-step, Sqoop starting only has the MapReduce operation of map, is read line by line according to data cutting value Access evidence；Kettle establishes conversion Transformation and task Job, after each link task parameters are arranged, starts workflow Cheng Jinhang data pick-up；Flume is placed in channel component and is cached by its source collect components daily record data, and Destination is sent data to by sink component；Kafka resolves into a series of batch processing jobs by Spark after collecting flow data In distributed elastic data set handled in real time；

Distributed caching sub-step carries out buffer-stored to the data of extraction, and wherein HDFS is responsible for the storage of bottom data, Hive It is responsible for the filtering of data, summarizes, inquire, analyzing, HBase is responsible for the change maintenance of data, in data conversion calculating process quilt The data frequently read are stored；

Data processing sub-step completes the cleaning and conversion of data according to the business rule of formulation, and wherein number is completed in data cleansing According to fill a vacancy, correct and clean, data conversion complete data inconsistent conversion, data granularity conversion and standard handovers.

4. the massive multi-source ETL process method of supporting interface adaptation as described in claim 1, characterized in that data add Step is carried, including：

The quality of data checks sub-step, the data object after conversion is carried out quality examination, to due to network interruption Caused by data exception problem verified, and check whether the quality of data that converts meets the standard of target database；

Data update load sub-step, will be loaded into target database by the data that check and find correct, according to pre-defining Data model, using timestamp, log sheet, full table compare, full table delete or insertion by the way of update target matrix.

5. the massive multi-source ETL process method of supporting interface adaptation as described in claim 1, characterized in that monitoring pipe Step is managed, including：

The ETL job execution process monitoring sub-step, to including job execution time, operation progress situation, whether overtime, work Industry is interrupted, job stacking information is monitored；The job execution time is monitored, time-out is set and is reminded, and is analyzed and is made by artificial judgment Industry timeout issue；Job execution log information is monitored, when there is job interruption, according to the interruption Restoration Mechanism of formulation, is touched again Send out job execution；Operation progress situation is monitored, when there is job stacking, is lined up according to job priority, preferential executive level is high Operation；

The ETL operation monitoring resource sub-step, is monitored the service condition of operation resource, if resource load is more than threshold value Shi Jinhang adjustment of load, pause or the low operation of stop section priority wait load to be down to threshold value or less and execute operation again；

System monitoring manages sub-step, is monitored to machine hardware information, cluster running state information, and to metadata, number It is managed according to bank interface, journal file interface or flow data interface.

6. the massive multi-source ETL process system of supporting interface adaptation, characterized in that including：

The essential information of data source and target database is arranged in data extraction module, adaptively matches phase for different data sources The ETL tool answered, and parameter setting is carried out to ETL tool；It is taken out by database interface, journal file interface or flow data interface Take different data sources；

Data conversion module is completed ETL job execution and management and running based on MapReduce and Spark Computational frame, is based on HDFS, Hive or HBase carry out buffer-stored and management to the data extracted, and complete the cleaning and conversion of data；

Data object after conversion is carried out quality examination by data loading module, and according to the table knot of data model definitions Structure output, the data update after checking and finding correct are loaded onto target database；

Monitoring management module is monitored pipe to ETL job execution process, operation resource service condition and running situation Reason.

7. the massive multi-source ETL process system of supporting interface adaptation as claimed in claim 6, characterized in that

The data extraction module, including：

The data-interface, including：Database interface, journal file interface and flow data interface, wherein by database interface, Database data is extracted from relevant database or non-relational database；By journal file interface, journal file is extracted； By flow data interface, flow data is extracted；

The adaptation rule engine is used to be arranged the essential information of data source and target database, including：Type of database, data Source and target database connection type, database IP, database-name, port, user name, password；Adaptation rule includes data The adaptation rule of the decision rule of source and target type of database, difference ETL tool；

ETL tools engine, integrated and manage ETL tool, the ETL tool, including：Sqoop, Flume, Kettle, Kafka, Isomeric data for database data, journal file, flow data from different data sources, the suitable ETL work of Adaptive matching Tool；

In the ETL tools engine, ETL adaptation engine according to the data source formulated in the adaptation rule engine of interface adapter and The adaptation rule of the decision rule of target database, difference ETL tool is different data source capability ETL tool.

8. the massive multi-source ETL process system of supporting interface adaptation as claimed in claim 6, characterized in that

The data conversion module, including：

Job scheduling management engine, using the host node of distributed type assemblies as management and running engine, including job management unit and Task scheduling unit；

The job management unit, including：ETL job design, operation configuration and monitoring operation, wherein ETL job design refers to According to actual service logic fulfil assignment dependence, whether increment extraction or extract frequency Operation control process design；Make Industry configures the configuration for referring to the priority that fulfils assignment, job execution mode parameter；Monitoring operation refers to the execution shape to ETL operation State and resource service condition are monitored management；

The task scheduling unit, including：Job scheduling strategy and job scheduling control, wherein when job scheduling strategy includes Between triggering, event triggering and immediate processing mode；Job scheduling control refers to setting job scheduling resource threshold value of warning, in resource When using more than threshold value, the low operation of pause priority；

Job execution engine is responsible for the execution of ETL operation, is based on using the slave node of distributed type assemblies as enforcement engine MapReduce Computational frame realizes the processed offline of ETL operation, and the real-time place of ETL operation is realized based on Spark Computational frame Reason；

In the job execution engine, Sqoop starting only has the MapReduce operation of map, is read line by line according to data cutting value Data；Kettle establishes conversion Transformation and task Job, after each link task parameters are arranged, starts workflow Carry out data pick-up；Flume is placed in channel component and is cached by its source collect components daily record data, and by Sink component sends data to destination；Kafka resolves into a series of batch processing jobs by Spark after collecting flow data Distributed elastic data set handled in real time；

Distributed caching submodule carries out buffer-stored and management to the data of extraction, and wherein HDFS is responsible for depositing for bottom data Storage, Hive are responsible for the filtering of data, summarize, inquire or analyze；HBase is responsible for the change maintenance of data, in data conversion meter Calculation process is written infrequently the data taken and is stored；

Business Rule Engine formulates cleaning rule to deficiency of data, wrong data and dirty data according to practical business rule, Transformation rule is formulated to the isomeric data from different business systems；

Data processing submodule, including：Data cleansing unit and Date Conversion Unit；According to practical business rule, data are completed Cleaning and conversion；

The data cleansing unit completes filling a vacancy, correct and cleaning for data, wherein data are filled a vacancy the missing of deficiency of data Information and mismatch information carry out completion；Wrong data is modified by data correction according to specific business；Data cleansing is directed to Dirty data design cleaning rule simultaneously determines its correctness；

The Date Conversion Unit is by the magnanimity M IS from different business systems at number needed for target database According to wherein the data of the same type from different business systems are carried out unification by inconsistent conversion；Data granularity conversion according to The operation system data extracted polymerize by target database granularity；Data standard is converted according to the standard pre-established Data model will extract data conversion into normal data needed for object library.

9. the massive multi-source ETL process system of supporting interface adaptation as claimed in claim 6, characterized in that

Data loading module, including：

Data object after conversion is carried out quality examination, is caused to due to network interruption by quality examination submodule Data exception problem verified, and check whether the quality of data that converts meets the standard of target database；

Data update load submodule, will be loaded into target database by the data that check and find correct, according to pre-defining Data model, using timestamp, log sheet, full table compare, full table delete or insertion by the way of update target matrix.

10. the massive multi-source ETL process system of supporting interface adaptation as claimed in claim 6, characterized in that

Monitoring management module, including：

Monitoring operation submodule is monitored the implementation procedure of ETL operation, including the job execution time, whether time-out, operation Interruption, job stacking, operation progress situation, operation resource service condition；

Load balancing submodule is assessed resource load situation according to ETL operation execution situation, is born when load is more than threshold value Adjustment, pause or the low operation of stop section priority are carried, waits load to be down to threshold value or less and executes operation again；

Metadata management submodule, data of the metadata as description data attribute information save definition, the conversion rule of data source Definition, implementation procedure definition then, are managed metadata information involved by ETL operation；

Interface management submodule, to database interface, journal file interface, flow data interface carry out interface definition, protocol adaptation, Data encapsulation, supports http, ftp, rest, webservice interface protocol；

Monitoring submodule is run, simultaneously collecting robot hardware information, resource information, load information, cluster component states, cluster are monitored Operation information, real-time perception and the operation conditions for analyzing cluster.