CN105701117B

CN105701117B - ETL dispatching method and device

Info

Publication number: CN105701117B
Application number: CN201410707712.7A
Authority: CN
Inventors: 周斌彦
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2014-11-27
Filing date: 2014-11-27
Publication date: 2019-06-21
Anticipated expiration: 2034-11-27
Also published as: CN105701117A

Abstract

The embodiment of the present invention provides a kind of ETL dispatching method and device, wherein this method comprises: first, determine that the first data warehouse corresponding to the task execution rule in each stage, the first data warehouse are the source data warehouse or purpose data warehouse in the data warehouse in each stage；Secondly, according between source data warehouse and purpose data warehouse logical relation and the first data warehouse establish Task Duplication table, work distribution chart is established according to the distributed way that the second data warehouse corresponding server uses, finally, being scheduled according to Task Duplication table and work distribution chart to the task in each stage.Due to not needing multiple independent ETL devices in each stage in system, an ETL device is only needed, dispatches the task in each stage by establishing Task Duplication table and work distribution chart, to improve the efficiency of management to ETL device, reduces maintenance complexity.

Description

ETL dispatching method and device

Technical field

The present embodiments relate to the communication technologys more particularly to a kind of extraction conversion to load (Extract-Transform- Load, ETL) dispatching method and device.

Background technique

As big data technology develops, distributed data-storage system is more and more, and big data application generally requires collection Construct the data warehouse of different application at multiple and different data-storage systems, ETL is used to describe by data from source data storehouse Library is by extracting, converting and load to the process of purpose data warehouse.Usual ETL device or it is used to be responsible for system for ETL tool The distribution of the scheduling controlling and resource of system operation program.

The usually above-mentioned corresponding server of data warehouse generally uses distributed deployment way, but the deployment used Mode is not quite similar, presently, there are major deployments mode are as follows: without sharing (Shated Nothing) framework and shared disk (Shared Disk) framework, wherein referring to that corresponding node (server) possesses independent in each data warehouse without share framework Central processing unit (Central Processing Unit, CPU), memory, disk resource, data are according to being regularly distributed on difference Node on.Shared disk framework refers to that each data warehouse corresponding node possesses independent CPU, memory, but is between node Shared disk space, data are unified to be stored.In the prior art, MPP (Massively Parallel Processing, MPP) in include multiple data warehouses, since each data warehouse corresponding server deployment way is not quite similar, Therefore, each stage can correspond to an ETL device, realize the distribution and scheduling of task.

However, the ETL device management low efficiency to discretization exists in the prior art, complex problem is safeguarded.

Summary of the invention

The present invention provides a kind of ETL dispatching method and device, to improve the efficiency of management to ETL device, reduces maintenance Complexity.

In a first aspect, one embodiment of the invention provides a kind of ETL dispatching method, comprising: determine that the task in each stage is held First data warehouse corresponding to line discipline, first data warehouse are the source data in the data warehouse in each stage Warehouse or purpose data warehouse；According to the logical relation and described between the source data warehouse and the purpose data warehouse One data warehouse establishes Task Duplication table, and the Task Duplication table includes: the list item and the purpose number in the source data warehouse According to the list item in warehouse；Work distribution chart is established according to the distributed way that the second data warehouse corresponding server uses, described the Two data warehouses are the source data warehouse or purpose data warehouse in the data warehouse in each stage, the work distribution chart It include: distributed way used by each second data warehouse corresponding server；According to the Task Duplication table and institute Work distribution chart is stated to be scheduled the task in each stage.

With reference to first aspect, in the first possible embodiment of first aspect, the Task Duplication table further include: the One parameter and the second parameter；First parameter is used to indicate that first data warehouse to be the source data storehouse in the stage Library；Second parameter is used to indicate that first data warehouse to be the purpose data warehouse in the stage.

The first possible embodiment with reference to first aspect, in second of possible embodiment of first aspect, institute It states according to the logical relation and first data warehouse foundation times between the source data warehouse and the purpose data warehouse Business duplication table, specifically includes: according to the logical relation determination between the source data warehouse and the purpose data warehouse The list item of the list item in source data warehouse and the purpose data warehouse；First parameter is determined according to first data warehouse With second parameter；According to the list item in the source data warehouse, the list item of the purpose data warehouse, first parameter and Second parameter establishes the Task Duplication table.

With reference to first aspect or second of the first possible embodiment of first aspect or first aspect may be implemented Mode, in the third possible embodiment of first aspect, further includes: the distributed way includes: without shared distribution side Formula and shared disk distribution mode.

The third possible embodiment with reference to first aspect, in the 4th kind of possible embodiment of first aspect, institute It states and the task in each stage is scheduled according to the Task Duplication table and the work distribution chart, specifically include: It is dispatched between the source data warehouse and the purpose data warehouse in each stage according to the determining distributed way The task in each stage.

Second aspect, one embodiment of the invention provide a kind of ETL dispatching device, comprising: determining module, it is each for determining First data warehouse corresponding to the task execution rule in stage, first data warehouse are the data bins in each stage Source data warehouse or purpose data warehouse in library；Module is established, for according to the source data warehouse and the purpose data Logical relation and first data warehouse between warehouse establish Task Duplication table, and the Task Duplication table includes: the source The list item of the list item of data warehouse and the purpose data warehouse；It is described to establish module, it is also used to according to the second data warehouse pair The distributed way for answering server to use establishes work distribution chart, and second data warehouse is the data bins in each stage Source data warehouse or purpose data warehouse in library, the work distribution chart include: the corresponding clothes of each second data warehouse Distributed way used by business device；Scheduler module is used for according to the Task Duplication table and the work distribution chart to described The task in each stage is scheduled.

In conjunction with second aspect, in the first possible embodiment of second aspect, the Task Duplication table further include: the One parameter and the second parameter；First parameter is used to indicate that first data warehouse to be the source data storehouse in the stage Library；Second parameter is used to indicate that first data warehouse to be the purpose data warehouse in the stage.

In conjunction with the first possible embodiment of second aspect, in second of possible embodiment of second aspect, institute It states and establishes module, be specifically used for: institute is determined according to the logical relation between the source data warehouse and the purpose data warehouse State the list item in source data warehouse and the list item of the purpose data warehouse；First ginseng is determined according to first data warehouse Several and second parameter；According to the list item in the source data warehouse, the list item of the purpose data warehouse, first parameter The Task Duplication table is established with second parameter.

Second in conjunction with the first of second aspect or second aspect possible embodiment or second aspect may implementation Mode, in the third possible embodiment of second aspect, further includes: the distributed way includes: without shared distribution side Formula and shared disk distribution mode.

In conjunction with the third possible embodiment of second aspect, in the 4th kind of possible embodiment of second aspect, institute Scheduler module is stated, is specifically used for: according to true between the source data warehouse and the purpose data warehouse in each stage The fixed distributed way dispatches the task in each stage.

The embodiment of the invention provides a kind of ETL dispatching method and devices, wherein this method comprises: determining each stage First data warehouse corresponding to task execution rule, first data warehouse are in the data warehouse in each stage Source data warehouse or purpose data warehouse；According between the source data warehouse and the purpose data warehouse logical relation and First data warehouse establishes Task Duplication table, and the Task Duplication table includes: the list item in the source data warehouse and described The list item of purpose data warehouse；Task distribution is established according to the distributed way that the second data warehouse corresponding server uses Table, second data warehouse is the source data warehouse or purpose data warehouse in the data warehouse in each stage, described Work distribution chart includes: distributed way used by each second data warehouse corresponding server；According to the task Duplication table and the work distribution chart are scheduled the task in each stage.Due in each stage in systems not Multiple independent ETL devices are needed, an ETL device is only needed, dispatch each rank by establishing Task Duplication table and work distribution chart The task of section, to reduce maintenance complexity to the efficiency of management of ETL device in raising system.

Detailed description of the invention

Fig. 1 is a kind of flow chart for ETL dispatching method that one embodiment of the invention provides；

Fig. 2 is the structural schematic diagram for the mpp system that one embodiment of the invention provides；

Fig. 3 is a kind of structural schematic diagram for ETL dispatching device that one embodiment of the invention provides.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Fig. 1 is a kind of flow chart for ETL dispatching method that one embodiment of the invention provides, and this method is suitable for including multiple The application scenarios of data warehouse, the wherein executing subject of this method are as follows: ETL dispatching device, the dispatching device can be ETL work Tool, a kind of ETL dispatching method specifically include following process:

S101: the first data warehouse corresponding to the task execution rule in each stage is determined, the first data warehouse is every Source data warehouse or purpose data warehouse in the data warehouse in a stage.

Specifically, usually include multiple data warehouses in the systems such as MPP MPP, flowed through in data flow Each stage in include active data warehouse and purpose data warehouse, each data warehouse is corresponding with the task execution of oneself Rule, task execution rule here include: to execute time, executive mode etc., ETL device can determination will be in source data warehouse With one is selected in purpose data warehouse as the first data warehouse, task is according to corresponding to the first data warehouse at this stage Task execution rule carry out.The method how ETL device determines the first data warehouse is not limited in the present embodiment.

S102: according between source data warehouse and purpose data warehouse logical relation and the first data warehouse establish task Replicate table.

Wherein, the Task Duplication table includes: the list item in the source data warehouse and the list item of the purpose data warehouse. The Task Duplication table further include: the first parameter and the second parameter；First parameter is for indicating first data warehouse For the source data warehouse in the stage；Second parameter is used to indicate that first data warehouse to be the mesh in the stage Data warehouse.

Optionally, the logical relation according in each stage between data warehouse and first data warehouse are established Task Duplication table, specifically includes: determining institute according to the logical relation between the source data warehouse and the purpose data warehouse State the list item in source data warehouse and the list item of the purpose data warehouse；First ginseng is determined according to first data warehouse Several and second parameter；According to the list item in the source data warehouse, the list item of the purpose data warehouse, first parameter The Task Duplication table is established with second parameter.

For example, Fig. 2 is the structural schematic diagram for the mpp system that one embodiment of the invention provides, it is assumed that is wrapped in mpp system Include following data warehouse: data source (Data Source) 201, in detail single library 202, analytical database (Analysis Database) 203 and user feature database 204, they respectively correspond file server, Hive server, Sybase IQ server and On RTANA server, wherein the number of file server, Sybase IQ server and RTANA server is all three, Hive Cluster relies on Hadoop cluster to realize distributed internal scheduling, provides a unified entrance, it can be interpreted as there is only One Hive server is supplied to ETL device, as shown in Fig. 2, the logical relation in each stage between data warehouse includes: The source data warehouse in one stage is data source 201, in detail list library 202；The source data warehouse of second stage is detailed single library 202, purpose Data warehouse is analytical database 203；The source data warehouse of phase III is analytical database 203, and purpose data warehouse is to use Family property data base 204.

Task Duplication table includes: source data warehouse list item and purpose data warehouse list item.The Task Duplication table further include: First parameter and the second parameter；First parameter is used to indicate that first data warehouse to be the source data storehouse in the stage Library；Second parameter is used to indicate that first data warehouse to be the purpose data warehouse in the stage.Assuming that the first ginseng Number is 1, and the second parameter is 2.Such as: assuming that determining the first number corresponding to the task execution rule of first stage in S101 step Data source according to warehouse, i.e., task execution rule according to data source executing rule, then set at the first row first row of table as First parameter 1 is the first parameter 1 at same second row secondary series, is the second parameter 2 at the third line third column.Pass through above-mentioned side Rule can establish Task Duplication table.

Task Duplication table provided in this embodiment, specific as follows:

S103: work distribution chart is established according to the distributed way that the second data warehouse corresponding server uses.

Specifically, second data warehouse is the source data warehouse or purpose number in the data warehouse in each stage According to warehouse, the work distribution chart includes: distributed way used by each second data warehouse corresponding server.Institute Stating distributed way includes: without shared distribution mode and shared disk distribution mode.Deployment side in the present invention between server Formula can also be active/standby mode etc., and distributed way without being limited thereto includes second data bins in each stage in work distribution chart Library corresponding server, it is assumed that represented without distribution mode is shared with 3,4 representatives of shared disk distribution mode, the in the present embodiment Two data warehouses are purpose data warehouse just, then the first row of work distribution chart is followed successively by file server from left to right, Sybase IQ server and RTANA server, the distribution mode that they are respectively adopted are as follows: shared disk distribution mode, shared magnetic Disk distribution mode and without shared distribution mode.

Work distribution chart provided in this embodiment, specific as follows:

Hive server	Sybase IQ server	RTANA server
			3
	3
				4

S104: the task in each stage is scheduled according to Task Duplication table and work distribution chart.

Optionally, described that the task in each stage is carried out according to the Task Duplication table and the work distribution chart Scheduling, specifically includes: according to determining institute between the source data warehouse and the purpose data warehouse in each stage State the task that distributed way dispatches each stage.

Then it is above-mentioned for example, it is assumed that the task of three phases is respectively as follows:

Task one: original detailed list is downloaded from file server, data are loaded directly into detailed single library of Hive server.

Task two: original detailed list is exported from detailed single library of Hive server, data are loaded into after over cleaning and convergence Sybase IQ server.

Task three: user property is exported from Sybase IQ server, RTANA server is loaded into, in RTANA server Calculate user characteristics.

Determined according to the logical relation between task, that is, data warehouse of three phases: the source data warehouse of first stage is Data source, purpose data warehouse are detailed Dan Ku；The source data warehouse of second stage is detailed Dan Ku, and purpose data warehouse is analysis number According to library；The source data warehouse of phase III is analytical database, and purpose data warehouse is user feature database.Last basis is every Logical relation and the first data warehouse in a stage between data warehouse establish Task Duplication table.

Since the first row of the work distribution chart of foundation is followed successively by the corresponding Hive server of data source from left to right, Sybase IQ server and RTANA server, the distribution mode that they are respectively adopted are as follows: shared disk distribution mode, shared magnetic Disk distribution mode and without shared distribution mode.Then the specific scheduling steps of three tasks include:

1, according to Task Duplication table, three tasks one will be replicated according to the number of file server, is distributed further according to task Table, all tasks execute on Hive server.

2, will be according to one task two of Hive server replicates, further according to work distribution chart according to Task Duplication table, this Business two can be by certain idle Sybase IQ server scheduling.

3, according to Task Duplication table, three tasks three will be replicated according to the number of RTANA server, is distributed further according to task Table, one task three of each RTANA server scheduling.

Present embodiments provide a kind of ETL dispatching method, comprising: firstly, determining the task execution rule institute in each stage Corresponding first data warehouse, wherein the first data warehouse is the source data warehouse or purpose number in the data warehouse in each stage According to warehouse；Secondly, according between source data warehouse and purpose data warehouse logical relation and the first data warehouse establish task Table is replicated, work distribution chart is established according to the distributed way that the second data warehouse corresponding server uses, finally, according to Task Duplication table and work distribution chart are scheduled the task in each stage.Due to each stage in mpp system In do not need multiple independent ETL devices, only need an ETL device, dispatched by establishing Task Duplication table and work distribution chart The task in each stage reduces maintenance complexity to improve in mpp system to the efficiency of management of ETL device.

Fig. 3 is a kind of structural schematic diagram for ETL dispatching device that one embodiment of the invention provides, wherein the device, comprising: Determining module 301, for determining the first data warehouse corresponding to the task execution rule in each stage, first data bins Library is the source data warehouse or purpose data warehouse in the data warehouse in each stage；Module 302 is established, for according to institute It states the logical relation between source data warehouse and the purpose data warehouse and first data warehouse establishes Task Duplication table, The Task Duplication table includes: the list item in the source data warehouse and the list item of the purpose data warehouse；It is described to establish module 302, it is also used to establish work distribution chart according to the distributed way that the second data warehouse corresponding server uses, described Business allocation table includes: source data warehouse or purpose data in the data warehouse that second data warehouse is each stage Warehouse, distributed way used by each second data warehouse corresponding server；Scheduler module 303, for according to institute Task Duplication table and the work distribution chart is stated to be scheduled the task in each stage.

Further, the Task Duplication table further include: the first parameter and the second parameter；First parameter is for indicating First data warehouse is the source data warehouse in the stage；Second parameter is for indicating first data warehouse For the purpose data warehouse in the stage.

Optionally, described to establish module 302, it is specifically used for: according to the source data warehouse and the purpose data warehouse Between logical relation determine the list item in the source data warehouse and the list item of the purpose data warehouse；According to first number First parameter and second parameter are determined according to warehouse；According to the list item in source data warehouse, the purpose number The Task Duplication table is established according to the list item in warehouse, first parameter and second parameter.

Optionally, the distributed way includes: without shared distribution mode and shared disk distribution mode.

Optionally, the scheduler module 303, is specifically used for: the source data warehouse and the mesh in each stage Data warehouse between the task in each stage is dispatched according to the determining distributed way.

ETL dispatching device provided in this embodiment can be used for executing the technical solution of the corresponding ETL dispatching method of Fig. 1, That the realization principle and technical effect are similar is similar for it, and details are not described herein again.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of extraction conversion loads ETL dispatching method characterized by comprising

Determine that the first data warehouse corresponding to the task execution rule in each stage, first data warehouse are described each Source data warehouse or purpose data warehouse in the data warehouse in stage；

According to the logical relation and first data warehouse foundation between the source data warehouse and the purpose data warehouse Task Duplication table, the Task Duplication table include: the list item in the source data warehouse and the list item of the purpose data warehouse, institute State Task Duplication table further include: first parameter or the second parameter in each stage, first parameter is for indicating described the One data warehouse is the source data warehouse in the stage；Second parameter is for indicating that first data warehouse is the rank The purpose data warehouse of section；

Work distribution chart, second data warehouse are established according to the distributed way that the second data warehouse corresponding server uses For the source data warehouse or purpose data warehouse in the data warehouse in each stage, the work distribution chart includes: each Distributed way used by the second data warehouse corresponding server；

The task in each stage is scheduled according to the Task Duplication table and the work distribution chart.

2. the method according to claim 1, wherein described according to the source data warehouse and the purpose data Logical relation and first data warehouse between warehouse establish Task Duplication table, specifically include:

The table in the source data warehouse is determined according to the logical relation between the source data warehouse and the purpose data warehouse The list item of item and the purpose data warehouse；

First parameter and second parameter are determined according to first data warehouse；

According to the list item in the source data warehouse, the list item of the purpose data warehouse, first parameter and second ginseng Number establishes the Task Duplication table.

3. method according to claim 1 or 2, which is characterized in that further include:

The distributed way includes: without shared distribution mode and shared disk distribution mode.

4. according to the method described in claim 3, it is characterized in that, described distribute according to the Task Duplication table and the task Table is scheduled the task in each stage, specifically includes:

According to the determining distribution side between the source data warehouse and the purpose data warehouse in each stage Formula dispatches the task in each stage.

5. a kind of ETL dispatching device characterized by comprising

Determining module, for determining the first data warehouse corresponding to the task execution rule in each stage, first data Warehouse is the source data warehouse or purpose data warehouse in the data warehouse in each stage；

Module is established, for according to the logical relation and described first between the source data warehouse and the purpose data warehouse Data warehouse establishes Task Duplication table, the Task Duplication table include: the source data warehouse list item and the purpose data The list item in warehouse, the Task Duplication table further include: first parameter or the second parameter in each stage, first parameter are used In the source data warehouse that expression first data warehouse is the stage；Second parameter is for indicating first number It is the purpose data warehouse in the stage according to warehouse；

It is described to establish module, it is also used to establish task distribution according to the distributed way that the second data warehouse corresponding server uses Table, second data warehouse is the source data warehouse or purpose data warehouse in the data warehouse in each stage, described Work distribution chart includes: distributed way used by each second data warehouse corresponding server；

Scheduler module, for being adjusted according to the Task Duplication table and the work distribution chart to the task in each stage Degree.

6. device according to claim 5, which is characterized in that it is described to establish module, it is specifically used for:

7. device according to claim 5 or 6, which is characterized in that further include:

8. device according to claim 7, which is characterized in that the scheduler module is specifically used for: