CN101567013B - Method and apparatus for implementing ETL scheduling - Google Patents

Method and apparatus for implementing ETL scheduling Download PDF

Info

Publication number
CN101567013B
CN101567013B CN2009102032769A CN200910203276A CN101567013B CN 101567013 B CN101567013 B CN 101567013B CN 2009102032769 A CN2009102032769 A CN 2009102032769A CN 200910203276 A CN200910203276 A CN 200910203276A CN 101567013 B CN101567013 B CN 101567013B
Authority
CN
China
Prior art keywords
program
flow
subtask
flows
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009102032769A
Other languages
Chinese (zh)
Other versions
CN101567013A (en
Inventor
蒋杰
陈荣松
蒋萃林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN2009102032769A priority Critical patent/CN101567013B/en
Publication of CN101567013A publication Critical patent/CN101567013A/en
Priority to HK10104106.1A priority patent/HK1137244A1/en
Application granted granted Critical
Publication of CN101567013B publication Critical patent/CN101567013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Stored Programmes (AREA)

Abstract

The invention discloses a method for scheduling ETL, comprising: determining the triggering mode, operation sequence and mutual dependency relationship of each subtask flow included in the task flow based on preset configuration file aiming at an optional task flow when implementing plural task flows included in the ETL scheduling; triggering corresponding subtask flows in turn according to the set triggering mode and implementing the triggered subtask flow according to set sequence, wherein after determining that at least a subtask flow is implemented, starting to implement other subtask flows depending on at least a subtask flow and other triggered subtasks based on the dependency relationship between subtask flows. Thus, subtask flow in each task flow has clear service logic and servicefunction so as to effectively enhance the implementing efficiency of the ETL scheduling flow. The invention also discloses an apparatus for implementing ETL scheduling.

Description

ETL scheduling implementation method and device
Technical Field
The present disclosure relates to the field of computers, and in particular, to a method and an apparatus for controlling a process.
Background
Data Warehouse (DW) is a topic-oriented, integrated, relatively stable Data collection that reflects historical changes for supporting administrative decisions. The data warehouse is an independent data environment, and data Extraction, Transformation and Loading (ETL) is an important ring for constructing the data warehouse.
The ETL is used to extract data (e.g., relational data, flat data files, etc.) in distributed and heterogeneous data sources to a temporary intermediate layer, and then perform cleaning, conversion, and integration, and finally load the data into a data warehouse according to a predefined data warehouse model, so that the constructed data warehouse becomes the basis of online analysis processing and data mining. Technically, ETL mainly involves several aspects such as association, conversion, increment, scheduling and monitoring. Generally, the data in the data warehouse does not require real-time synchronization with the data in the online transaction processing system, so the ETL can be performed in a timing manner, but the operation time, sequence and success or failure of a plurality of ETLs have a crucial influence on the validity of the data in the data warehouse, thereby directly influencing the quality of the online analysis processing result and the data mining result.
Currently, ETL performed when building a data warehouse is mostly used: a Java Business Process Management (JBPM) engine based on Java programming language realizes the flow control of program codes, and the JBPM engine is a J2 EE-based lightweight workflow Management system which is used as an enterprise-level flow engine and externally expands modules such as identity components, compatible data packets, task Management and the like. When the JBPM engine runs, tokens (namely authentication authorities such as user names, passwords and the like) are used as media to be sequentially transmitted among nodes, the tokens arrive at which node, a program mapped by which node starts to be executed, and the tokens are transmitted downwards after the program execution is finished. The token can be cloned into two parts in the transmission process and respectively transmitted to two nodes with independent task programs, and the two nodes can execute the corresponding task programs in parallel; if a task program executed by a node depends on the execution results of the two task programs, the node needs to collect two tokens respectively used for indicating that the two task programs are executed completely, and then the node can start to execute the corresponding task program.
The JBPM engine is not a process engine designed for the process scheduling of ETL, so the process scheduling management of ETL using the JBPM engine will reduce the execution efficiency of ETL, for example:
the existing JBPM engine is an open process engine, which is not only applied to ETL field, but also applied to OA-office automation, CRM-customer relationship management and other systems, therefore, many functions unrelated to ETL process scheduling management are nested in the JBPM engine, such as: swimlane-lane, security authentication management, message service, etc. Moreover, the JBPM engine kernel adopts a token type sequential transmission mode, and cannot adapt to the actual application requirements in ETL flow scheduling such as 'task rollback', 'task jump forward', and the like.
On the other hand, the process description language (JPDL) adopted by the JBPM engine cannot describe the dependency relationship between parent and child processes, and can only extend the node class in the JAVA program by a programmer to implement a similar function, so that the child process needs to be created repeatedly by the extension class in the process of executing the task program, thereby seriously reducing the execution efficiency of the parent process. Meanwhile, the visual management of the whole ETL scheduling process is not provided in the JBPM process designer, so that the technical difficulty of process design and later maintenance is increased.
In the embodiment of the application, when a plurality of task flows included in ETL scheduling are executed, for any one task flow, a triggering mode, an execution sequence and a mutual dependency relationship of each subtask flow included in the task flow are determined according to a preset configuration file; and sequentially triggering the corresponding subtask flows according to a set triggering mode, and executing the triggered subtask flows according to a set sequence, wherein when at least one subtask flow is determined to be executed completely, other subtask flows which depend on the at least one subtask flow and are triggered are started to be executed according to the dependency relationship among the subtask flows. Therefore, the business logic between the subtask flows in each task flow is clear, and the business function is clear, so that the execution efficiency of the ETL scheduling flow is effectively improved.
Disclosure of Invention
The embodiment of the application provides a flow scheduling method and device of an ETL (extract-transform-load) for improving the execution efficiency of an ETL scheduling flow.
The embodiment of the application provides the following specific technical scheme:
an implementation method of ETL scheduling, wherein the ETL scheduling comprises a plurality of task flows, each task flow comprises a plurality of subtask flows, and one task flow comprises the following steps:
determining a triggering mode, an execution sequence and a mutual dependency relationship of each subtask flow contained in a task flow according to a preset configuration file;
and sequentially triggering the corresponding subtask flows according to a set triggering mode, and executing the triggered subtask flows according to a set sequence, wherein when at least one subtask flow is determined to be executed completely, other subtask flows which depend on the at least one subtask flow and are triggered are started to be executed according to the dependency relationship among the subtask flows.
An apparatus for performing ETL scheduling, comprising:
the storage unit is used for storing a configuration file, and the configuration file at least comprises triggering modes of sub task flows in a plurality of task flows belonging to ETL scheduling and a mutual dependency relationship;
a determining unit, configured to determine, according to the configuration file, a triggering manner, an execution order, and a dependency relationship between each subtask flow included in a certain task flow;
and the processing unit is used for sequentially triggering the corresponding subtask flows according to a set triggering mode and executing the triggered subtask flows according to a set sequence, wherein when at least one subtask flow is determined to be executed completely, other subtask flows which depend on the at least one subtask flow and are triggered are started to be executed according to the dependency relationship among the subtask flows.
The embodiment of the invention provides a nested ETL scheduling implementation method, namely when a plurality of task flows contained in ETL scheduling are executed, aiming at any one task flow, determining a triggering mode, an execution sequence and a mutual dependency relationship of each subtask flow contained in the task flow according to a preset configuration file; and sequentially triggering the corresponding subtask flows according to a set triggering mode, and executing the triggered subtask flows according to a set sequence, wherein when at least one subtask flow is determined to be executed completely, other subtask flows which depend on the at least one subtask flow and are triggered are started to be executed according to the dependency relationship among the subtask flows. Therefore, the business logic between the subtask flows in each task flow is clear, and the business function is clear, so that the execution efficiency of the ETL scheduling flow is effectively improved. The invention also discloses a device for executing the ETL scheduling.
Drawings
FIG. 1 is a schematic diagram illustrating a task flow involved in ETL scheduling in an embodiment of the present application;
FIG. 2 is a schematic diagram of a task flow component structure in an embodiment of the present application;
FIG. 3A is a functional structure diagram of an ETL scheduling server in the embodiment of the present application;
FIGS. 3B and 3C are functional block diagrams of the processing unit in the ETL scheduling server according to the embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a setting manner of a task flow 1 in an embodiment of the present application;
FIG. 5 is a flowchart illustrating a first manner of task execution flow of an ETL scheduling server according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating a second manner of task execution by an ETL scheduling server according to an embodiment of the present application;
fig. 7 is an ETL scheduling server in an embodiment of the present application.
Detailed Description
In order to improve the execution efficiency of an extract-Transformation-Loading (ETL) scheduling process when a data warehouse is constructed, in the embodiment of the present invention, for any one task flow in a plurality of task flows included in ETL scheduling, the following operations are performed: determining a triggering mode, an execution sequence and a mutual dependency relationship of each subtask flow contained in a task flow according to a preset configuration file; and sequentially triggering the corresponding subtask flows according to a set triggering mode, and executing the triggered subtask flows according to a set sequence, wherein when at least one subtask flow is determined to be executed completely, other subtask flows which depend on the at least one subtask flow and are triggered are started to be executed according to the dependency relationship among the subtask flows.
In the embodiment of the application, concepts of task flows, subtask flows and program flows are defined in the ETL scheduling model.
Referring to fig. 1, the ETL scheduling work is composed of a plurality of task flows, wherein each task flow runs according to a set dependency relationship, and the task flows may be in series or in parallel.
Referring to fig. 2, a task flow is a basic unit of ETL scheduling, and is composed of one or more subtask flows (hereinafter, referred to as task blocks), where a task block is used for describing the execution purpose of a service, an independent task block includes one or more program flows (only one task block includes one program flow in fig. 2 for example), and a flow composed of multiple task blocks in a logical dependency relationship is a task flow. For example, referring to fig. 2, four task blocks, task 1, task 2, task 3, and task 4, constitute a task flow 1.
As shown in fig. 2, the program flow is composed of one or more independent program blocks in a logic dependent manner, and each program block must ensure atomicity, independence, consistency and continuity of the transaction, i.e., each degree block is an atomic-level process. The program block can be a shell script, a java program, an oracle stored procedure, an SQL block and the like; one or more program flows within a task block are specific implementations of methods and steps for accomplishing a task. For example, referring to fig. 1, four program blocks, program 11, program 12, program 13, and program 14, constitute a program flow to complete task 1.
In this embodiment, a specific description will be given by taking an example in which one task block includes one program flow.
The beginning of the first block in the program flow contained in the task block indicates that the task block begins execution, and the end of the last block indicates that the task block ends execution. For example, as shown in FIG. 2, the beginning of the program 11 indicates the beginning of the execution of task 1, and the ending of the program 14 indicates the ending of the execution of task 1.
In this embodiment, the execution process of each task block is divided into two stages, namely, triggering and running, where the triggering of the task includes time triggering and event triggering, and the triggering stage is only to generate a task instance, but the current task instance must wait for the completion of the running of its predecessor task to run.
The following description will take the example where a task block is triggered by time. In this embodiment, each task block in the ETL scheduling flow may be triggered by itself according to a set time point, for example, 8:00 am every day, 12:00 am every week monday, or 13:00 pm every month on the first day; it may also be triggered by itself according to a set cyclic period, for example, every 24 hours, every week, or every month, etc.
For example, as shown in fig. 1, task 1 is a preceding task of tasks 2 and 3 and is set to be executed once per month, and tasks 2 and 3 are set to be executed once per day, then tasks 2 and 3 need to determine whether task 1 has been executed and completed in the current month before each execution, and after determining that task 1 has been executed and completed in the current month, the subsequent operations are executed. The design has the advantages that the tight coupling of the tasks and the task flows is reduced, one task can represent a certain link in the task flows and can be independently and repeatedly called, the whole task flow can skip the tasks which are not required to be repeatedly executed due to the definition of the cycle period, and the execution performance is improved.
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Referring to fig. 3A, in the present embodiment, an ETL scheduling server for executing an ETL scheduling procedure includes a storage unit 300, a determination unit 301, and a processing unit 302, wherein,
a storage unit 300, configured to store a configuration file, where the configuration file at least includes trigger modes and dependency relationships between sub-task flows in a plurality of task flows belonging to ETL scheduling;
a determining unit 301, configured to determine, according to the configuration file, a triggering manner, an execution order, and a dependency relationship between each subtask flow included in a certain task flow;
and the processing unit 302 is configured to sequentially trigger corresponding subtask flows according to a set triggering manner, and execute the triggered subtask flows according to a set order, where when it is determined that at least one subtask flow has been executed, other subtask flows that depend on the at least one subtask flow and have been triggered start to be executed according to a dependency relationship between the subtask flows.
In the ETL scheduling server, the storage unit 300 is further configured to store the triggering manner, the execution order, and the dependency relationship among the task blocks included in the task flow, and the triggering manner, the execution order, and the dependency relationship among the program blocks included in the program flow; as shown in fig. 3, in the embodiment of the present application, the ETL scheduling server further includes a monitoring unit 303, configured to monitor execution conditions of each task flow, each subtask flow, each program flow, and each program block, and notify an authorized user of a monitoring result in an email or short message manner.
Referring to fig. 3B, the processing unit 302 further comprises a first processing sub-unit 3020, a second processing sub-unit 3021 and a third processing sub-unit 3022, wherein,
the first processing subunit 3020 is configured to, when a certain task flow is executed, sequentially trigger corresponding task blocks (i.e., corresponding sub-task flows) according to a set trigger manner, and execute the triggered task blocks in a set order, where, when it is determined that at least one task block has been executed, other task blocks that depend on the at least one task block and have been triggered start to be executed according to a dependency relationship between the task blocks.
A second processing subunit 3021, configured to, when executing a certain task block (i.e., a certain subtask flow), sequentially trigger corresponding program flows according to a set trigger manner, and execute the triggered program flows according to a set order, where, when it is determined that at least one program flow has been executed, execution of other program flows that depend on the at least one program flow and have been triggered is started according to a preset dependency relationship between the program flows.
And a third processing subunit 3022, configured to, when a certain program flow is executed, sequentially trigger corresponding program blocks according to a set trigger manner, and execute the triggered program blocks according to a set order, where, when it is determined that at least one program block is executed completely, according to a preset dependency relationship between the program blocks, execution of other program blocks that depend on the at least one program block and are triggered is started, where the program block is an atomic-level process.
Or,
referring to fig. 3C, if a certain task flow is composed of a single program flow, that is, it can be regarded as being directly composed of a plurality of program blocks, the processing unit 302 may process the program blocks included in the task flow through an internally disposed processing subunit, where the processing subunit is configured to, when executing a certain program flow, sequentially trigger corresponding program blocks according to the trigger manner and the sequence stored in the storage unit 300, and execute the triggered program blocks, where, when it is determined that at least one program block has been executed, according to a preset dependency relationship between the program blocks, execution of other program blocks that depend on the at least one program block and have been triggered is started, where the program blocks are atomic-level processes.
Based on the ETL scheduling server, the task flow 1 shown in fig. 1 is taken as an example in the present embodiment. Referring to fig. 2, the task flow 1 includes four tasks to be executed, which are task 1, task 2, task 3, and task 4, and the dependency relationship is: the triggering of the task 2 and the task 3 depends on the running result of the task 1, the task 2 and the task 3 are in a parallel relation, and the triggering of the task 4 depends on the running results of the task 2 and the task 3; assuming that task 1 is set to be triggered at 03:00 a day, task 2 is set to be triggered at 4:00 a 13 a month, task 3 is set to be triggered at 04:00 a day, and task 4 is set to be triggered at 05:00 a day, as shown in fig. 4 and 5, assuming that the execution date is 12 a month, the detailed flow of the first mode of the ETL scheduling server executing task flow 1 is as follows:
step 500: when the time reaches 03:00, the task 1 is triggered and starts to run, and the running mark of the task 1 marks the starting of the task flow 1.
Assuming that the content related to the present application relates to a network payment system, in practical applications, there may be a situation that a user uses a credit card to perform illegal cash registering when using the network payment system, this embodiment may be used to count the user behavior using the network payment system, so as to identify the illegal cash registering behavior of the user, where the content executed by the task flow 1 may be various, for example: the task flow 1 is as follows: the "cash register data extraction" is performed, and task 1 is "count previous month card information", task 2 is "count contracted merchant list", task 3 is "count white list", and task 4 is "count filtered credit card user number", and the like.
Step 510: when the time reaches 04:00, after the date of the day is determined not to be 13 th of the month, skipping task 2 and not running, and simultaneously triggering task 3 but not running, because task 3 is a post-task of task 1, the running result of task 1 is not obtained, and task 3 cannot run immediately after being triggered.
Step 520: when the time reaches 04:35, the task 1 is finished, the running result of the task 1 is obtained, and the task 3 is started to run.
Step 530: when the time reaches 04:50, the task 3 is finished running, and the running result of the task 3 is obtained.
Step 540: at time 05:00, task 4 is triggered and begins running.
In this embodiment, since the task 3 is already run before the task 4 is triggered, the task 4 can be run immediately after being triggered, and since the task 2 is skipped in the current flow, the task 4 is a post-task of the task 2, but the running result of the task 2 can be ignored in the current flow, and the running can be started only according to the running result of the task 3.
On the other hand, how to trigger the task 4 and the task 3 is not yet completed, the task 4 needs to wait until the task 3 is completed and can start to operate after the operation result of the task 3 is obtained.
Step 550: at time 06:15, task 4 is done, indicating that task flow 1 is finished.
In the above steps 500 to 550, each task is triggered by a set time point, and in practical applications, each task may also be set to be triggered by an event, for example: in the process described in step 500-550, when any task process is abnormal, the error collector installed in the ETL scheduling server will automatically invoke the task of binding the abnormal situation and handling the abnormal event.
In another case, assuming that the execution date is 13 th of a month, referring to fig. 6, the detailed flow of the second mode of the ETL scheduling server executing the task flow 1 is as follows:
step 600: when the time reaches 03:00, the task 1 is triggered and starts to run, and the running mark of the task 1 marks the starting of the task flow 1.
Step 610: when the time reaches 04:00, after the date of the day is determined to be 13 th of the month, the task 2 and the task 3 are triggered but not run, and meanwhile, the task 3 is triggered but not run, because the task 2 and the task 3 are post tasks of the task 1, the running result of the task 1 is not obtained, and the task 1 and the task 3 cannot run immediately after being triggered.
Step 620: when the time reaches 04:35, the task 1 is finished, the running result of the task 1 is obtained, and the task 2 and the task 3 are started to run.
Step 630: and when the time reaches 04:50, completing the operation of the task 2 and the task 3, and obtaining the operation results of the task 2 and the task 3.
In this embodiment, for convenience of explanation, it is assumed that task 2 and task 3 are completed at the same time.
Step 640: at time 05:00, task 4 is triggered and begins running.
In this embodiment, since task 2 and task 3 are already run before task 4 is triggered, task 4 can be run directly after being triggered.
Step 650: at time 06:15, task 4 is done, indicating that task flow 1 is finished.
Similarly, in the above steps 600 to 650, each task is triggered by a set time point, and in practical applications, each task may also be set to be triggered by an event, which is not described herein again.
In the above embodiment, each program flow included in each task block also needs to be run one by one according to a set dependency relationship, for example, when a certain task block including a plurality of program flows is executed, corresponding program flows are sequentially triggered according to a set triggering manner, and the triggered program flows are executed according to a set order, wherein when it is determined that at least one program flow is executed, other program flows which depend on the at least one program flow and are triggered are started to be executed according to a preset dependency relationship between the program flows; the program flows without dependency relationship can be triggered in parallel, and are not described herein again.
Further, when a certain program flow including a plurality of program blocks is executed, the corresponding program blocks are sequentially triggered according to a set triggering manner, and the triggered program blocks are executed according to a set sequence, wherein when it is determined that at least one program block is executed, other program blocks which depend on the at least one program block and are triggered are started to be executed according to a preset dependency relationship among the program blocks, and the program blocks which do not have the dependency relationship can be triggered in parallel, and the program blocks are atomic-level processes, which is not described herein again. The following description will be made by taking task 1 as an example. As shown in fig. 1, it is assumed that task 1 includes four blocks, namely program 11, program 12, program 13, and program 14, whose dependencies are: the triggering of the program 2 and the program 3 depends on the operation result of the program 1, the program 2 and the program 3 are in parallel relation, and the triggering of the program 4 depends on the operation results of the program 2 and the program 3; then, referring to fig. 1, after the task 1 starts to run, the program 1 is triggered first, after the running of the program 1 is finished, if the programs 2 and 3 are triggered, the running is started synchronously according to the running result of the program 1, and after the programs 2 and 3 are finished, if the programs 4 are triggered, the running is started until the running is finished according to the running results of the programs 2 and 3, and the finished running of the program 4 also indicates the end of the task 1.
Based on the technical scheme, in the embodiment of the application, a user interface is provided externally through the ETL scheduling server, so that a maintainer can remotely access the ETL scheduling server, and definition and log management are performed on the ETL scheduling process through an operation interface based on a WEB page provided by the ETL scheduling server. Referring to fig. 7, the operation status log management of the ETL dispatch flow is described in fig. 7, and a maintenance person can perform operations such as retry, skip, suspend, etc. on a running program block, a task block, or the whole task flow through a visual operation interface. Or judging the program execution performance according to the start/end/interval time of program execution, and continuously optimizing and adjusting the program flow by using the flow designer. Further, a message notification mode can be configured on the ETL scheduling server, and in the case that a maintenance person is absent, the relevant maintenance person can be notified of the running status of the program block, the task block and the whole task flow in the form of short message, mail or the like.
In summary, the embodiment of the present application adopts a nested ETL scheduling flow implementation method, so that service logics and service functions between subtask flows, between task blocks, between program flows and between program blocks in each task flow are clear, and thus the execution efficiency of the ETL scheduling flow is effectively improved; in addition, because the operation interface based on the Web page is adopted to carry out independent operations such as redoing, hanging and the like on each task flow and each subtask flow in the ETL scheduling and each program flow and each program block, the coupling between each task flow, each subtask flow and each program block is reduced, thereby reducing the learning difficulty of the later maintenance management work and saving the learning and maintenance cost.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments herein without departing from the spirit and scope of the application. Thus, if such modifications and variations in the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the embodiments of the present application are intended to include such modifications and variations.

Claims (12)

1. A method for realizing data extraction, conversion and ETL scheduling loading, wherein the ETL scheduling comprises a plurality of task flows, each task flow comprises a plurality of subtask flows, and the method is characterized in that for one task flow, the method comprises the following steps:
determining a triggering mode, an execution sequence and a mutual dependency relationship of each subtask flow contained in a task flow according to a preset configuration file;
and sequentially triggering the corresponding subtask flows according to a set triggering mode, and executing the triggered subtask flows according to a set sequence, wherein when at least one subtask flow is determined to be executed completely, other subtask flows which depend on the at least one subtask flow and are triggered are started to be executed according to the dependency relationship among the subtask flows.
2. The method of claim 1, wherein a subtask flow comprises at least one program flow.
3. The method according to claim 2, wherein if a subtask flow includes at least two program flows, when executing the subtask flow, the corresponding program flows are sequentially triggered according to a set triggering manner, and the triggered program flows are executed according to a set sequence, wherein when it is determined that at least one program flow has been executed, other program flows that depend on the at least one program flow and are triggered start to be executed according to a preset dependency relationship between the program flows.
4. A method as claimed in claim 2 or 3, wherein a program flow includes at least one program block, and if a program flow includes a plurality of program blocks, when a program flow is executed, the corresponding program blocks are sequentially triggered according to a set triggering manner, and the triggered program blocks are executed according to a set sequence, wherein when it is determined that at least one program block has been executed, other program blocks which depend on the at least one program block and are triggered are started to be executed according to a preset dependency relationship between the program blocks, and the program blocks are atomic-level processes.
5. The method of claim 4, wherein the managing operations are performed on task flows, subtask flows, program flows, and program blocks through a Web page-based operational interface.
6. The method of claim 4, wherein the execution of each task flow, each subtask flow, each program flow, and each program block is monitored, and the authorized user is notified of the monitoring result by mail or short message.
7. An apparatus for performing ETL scheduling for data extraction, transformation, and loading, comprising:
the storage unit is used for storing a configuration file, and the configuration file at least comprises triggering modes of sub task flows in a plurality of task flows belonging to ETL scheduling and a mutual dependency relationship;
a determining unit, configured to determine, according to the configuration file, a triggering manner, an execution order, and a dependency relationship between each subtask flow included in a certain task flow;
and the processing unit is used for sequentially triggering the corresponding subtask flows according to a set triggering mode and executing the triggered subtask flows according to a set sequence, wherein when at least one subtask flow is determined to be executed completely, other subtask flows which depend on the at least one subtask flow and are triggered are started to be executed according to the dependency relationship among the subtask flows.
8. The apparatus according to claim 7, wherein the storage unit is further configured to store triggering manners, execution orders, and dependencies among the program flows included in the respective subtask flows, and further configured to store triggering manners, execution orders, and dependencies among the program blocks included in the respective program flows, the program blocks being atomic-level processes.
9. The apparatus of claim 8, wherein the processing unit further comprises:
the first processing subunit is used for sequentially triggering the corresponding subtask flows according to the triggering mode and the sequence stored in the storage unit when executing a certain task flow and executing the triggered subtask flows, wherein when determining that at least one subtask flow is executed completely, other subtask flows which depend on the at least one subtask flow and are triggered are started to be executed according to the dependency relationship among the subtask flows;
the second processing subunit is used for sequentially triggering the corresponding program flows according to the triggering mode and the sequence stored in the storage unit and executing the triggered program flows when executing a certain subtask flow, wherein when determining that at least one program flow is executed completely, other program flows which depend on the at least one program flow and are triggered are started to be executed according to the preset dependency relationship among the program flows;
and the third processing subunit is used for sequentially triggering the corresponding program blocks according to the triggering mode and the sequence stored in the storage unit and executing the triggered program blocks when executing a certain program flow, wherein when determining that at least one program block is executed, other triggered program blocks which depend on the at least one program block are started to be executed according to the preset dependency relationship among the program blocks.
10. The apparatus as claimed in claim 8, wherein the processing unit further comprises a processing subunit, and the processing subunit is configured to, when executing a certain program flow, sequentially trigger the corresponding program blocks according to the trigger manner and the sequence stored in the storage unit, and execute the triggered program blocks, wherein, when it is determined that at least one program block has been executed, the execution of other program blocks that depend on the at least one program block and have been triggered is started according to the preset dependency relationship between the program blocks.
11. The apparatus of claims 7-10, wherein the apparatus further comprises:
and the user interface unit is used for providing an operation interface based on the Web page for a user and receiving the management operation of the user on each task flow, each subtask flow, each program flow and each program block through the operation interface.
12. The apparatus of claim 10, wherein the apparatus further comprises:
and the monitoring unit is used for monitoring the execution conditions of each task flow, each subtask flow, each program flow and each program block and informing the authorized user of the monitoring result in a mail or short message mode.
CN2009102032769A 2009-06-02 2009-06-02 Method and apparatus for implementing ETL scheduling Active CN101567013B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2009102032769A CN101567013B (en) 2009-06-02 2009-06-02 Method and apparatus for implementing ETL scheduling
HK10104106.1A HK1137244A1 (en) 2009-06-02 2010-04-27 Method for implementing etl scheduling and apparatus thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102032769A CN101567013B (en) 2009-06-02 2009-06-02 Method and apparatus for implementing ETL scheduling

Publications (2)

Publication Number Publication Date
CN101567013A CN101567013A (en) 2009-10-28
CN101567013B true CN101567013B (en) 2011-09-28

Family

ID=41283162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102032769A Active CN101567013B (en) 2009-06-02 2009-06-02 Method and apparatus for implementing ETL scheduling

Country Status (2)

Country Link
CN (1) CN101567013B (en)
HK (1) HK1137244A1 (en)

Families Citing this family (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236580B (en) * 2010-04-26 2013-03-20 阿里巴巴集团控股有限公司 Method for distributing node to ETL (Extraction-Transformation-Loading) task and dispatching system
CN102004666A (en) * 2010-11-25 2011-04-06 中国工商银行股份有限公司 Large-scale computer operation scheduling equipment and system
WO2012075622A1 (en) * 2010-12-07 2012-06-14 Sap Ag Implemantion of a process based on a user-defined sub-task sequence
CN102073540B (en) * 2010-12-15 2013-05-08 北京新媒传信科技有限公司 Distributed affair submitting method and device thereof
CN102541959B (en) * 2010-12-31 2014-03-12 中国移动通信集团安徽有限公司 Method, device and system for scheduling electron transport layer (ETL)
CN102750179B (en) * 2011-04-22 2014-10-01 中国移动通信集团河北有限公司 Method and device for scheduling tasks between cloud computing platform and data warehouse
CN102279888B (en) * 2011-08-24 2014-04-30 北京新媒传信科技有限公司 Method and system for scheduling tasks
CN102375891A (en) * 2011-11-15 2012-03-14 山东浪潮金融信息系统有限公司 Implementation tool for unloading and loading incremental data
CN103514028B (en) * 2012-06-14 2016-12-21 北京新媒传信科技有限公司 A kind of method and apparatus processing distributed transaction
CN102999816B (en) * 2012-12-05 2016-02-24 中邮科通信技术股份有限公司 The workflow engine of personalized operation flow
CN103902574A (en) * 2012-12-27 2014-07-02 中国移动通信集团内蒙古有限公司 Real-time data loading method and device based on data flow technology
CN103034554B (en) * 2012-12-30 2015-11-18 焦点科技股份有限公司 The ETL dispatching system that a kind of error correction is restarted and automatic decision starts and method
CN103164337B (en) * 2013-02-28 2015-12-09 汉柏科技有限公司 Based on the cloud computing method for testing software of finite state machine
CN104679482A (en) * 2013-11-27 2015-06-03 北京拓尔思信息技术股份有限公司 OSGI (Open Service Gateway Initiative)-based ETL (Extraction-Transformation-Loading) processing device and method
CN104778074B (en) * 2014-01-14 2019-02-26 腾讯科技(深圳)有限公司 A kind of calculating task processing method and processing device
CN105095327A (en) * 2014-05-23 2015-11-25 深圳市珍爱网信息技术有限公司 Distributed ELT system and scheduling method
CN104572257A (en) * 2014-07-30 2015-04-29 南京坦道信息科技有限公司 United coordination dispatching algorithm based on finite state automata for various high-concurrency jobs
CN104462243B (en) * 2014-11-19 2018-09-07 上海烟草集团有限责任公司 A kind of ETL scheduling system and methods of combination data check
CN104484167B (en) * 2014-12-05 2018-03-09 广州华多网络科技有限公司 Task processing method and device
CN105808619B (en) * 2014-12-31 2019-08-06 华为技术有限公司 Method, impact analysis computing device and the key reset apparatus that task based on impact analysis is reformed
CN104750522B (en) * 2015-03-12 2018-01-05 用友网络科技股份有限公司 The Dynamic Execution method and system of task or flow
CN106528275A (en) * 2015-09-10 2017-03-22 网易(杭州)网络有限公司 Processing method of data tasks and task scheduler
CN105321045A (en) * 2015-11-04 2016-02-10 北京知聚科技有限公司 Service process formal model construction method and system
CN105446808B (en) * 2015-11-12 2019-05-21 国云科技股份有限公司 A kind of method that combined task completes complex task
CN106708854B (en) * 2015-11-13 2020-05-22 博雅网络游戏开发(深圳)有限公司 Data export method and device
CN106712924B (en) * 2015-11-16 2021-03-19 方正国际软件(北京)有限公司 Method and device for realizing universal time sequence communication
CN105677462A (en) * 2015-12-30 2016-06-15 生迪光电科技股份有限公司 Distributed task system based on internet of things and business processing method
CN107025224B (en) * 2016-01-29 2020-10-16 阿里巴巴集团控股有限公司 Method and equipment for monitoring task operation
CN105976158A (en) * 2016-04-26 2016-09-28 中国电子科技网络信息安全有限公司 Visual ETL flow management and scheduling monitoring method
CN107479962B (en) * 2016-06-08 2021-05-07 阿里巴巴集团控股有限公司 Method and equipment for issuing task
CN106293920A (en) * 2016-08-15 2017-01-04 北京票之家科技有限公司 Method for scheduling task and device
CN107145576B (en) * 2017-05-08 2020-06-23 科技谷(厦门)信息技术有限公司 Big data ETL scheduling system supporting visualization and process
CN109408204A (en) * 2017-08-15 2019-03-01 阿里巴巴集团控股有限公司 A kind of method for scheduling task and device of distributed task scheduling system
CN108564281B (en) * 2018-04-13 2022-04-05 浙江传媒学院 Method for realizing outsourcing work task scheduling system based on structuralization
CN109240810B (en) * 2018-08-03 2021-02-23 腾讯科技(深圳)有限公司 Task processing method and device and storage medium
CN109359949B (en) * 2018-10-30 2022-05-27 中国建设银行股份有限公司 Flow display method and device
CN109445929A (en) * 2018-11-16 2019-03-08 杭州数澜科技有限公司 A kind of method and system of scheduler task
CN109857794A (en) * 2018-12-29 2019-06-07 南瑞集团有限公司 A kind of implementation method and its system of the high concurrent lightweight data integration framework based on response
CN111176802B (en) * 2019-07-26 2023-03-14 腾讯科技(深圳)有限公司 Task processing method and device, electronic equipment and storage medium
CN111082976B (en) * 2019-12-02 2022-07-29 东莞数汇大数据有限公司 Method for supporting ETL task scheduling visualization
CN113127522B (en) * 2019-12-31 2024-05-10 阿里巴巴集团控股有限公司 Data processing method, device, system and storage medium
CN111427943A (en) * 2020-03-27 2020-07-17 北京明略软件系统有限公司 Task management method and device in ET L system
CN111930814B (en) * 2020-05-29 2024-02-27 武汉达梦数据库股份有限公司 File event scheduling method based on ETL system and ETL system
CN111857984A (en) * 2020-06-01 2020-10-30 北京文思海辉金信软件有限公司 Job calling processing method and device in bank system and computer equipment
CN111914010B (en) * 2020-08-04 2024-02-20 北京百度网讯科技有限公司 Method, device, equipment and storage medium for processing business
CN112486502A (en) * 2020-11-30 2021-03-12 京东方科技集团股份有限公司 Distributed task deployment method and device, computer equipment and storage medium
CN112667383B (en) * 2020-12-31 2024-02-09 北京高途云集教育科技有限公司 Task execution and scheduling method, system, device, computing equipment and medium
CN113111106A (en) * 2021-04-06 2021-07-13 创意信息技术股份有限公司 ETL design data access method and data access module based on Web
CN113138807B (en) * 2021-04-25 2022-09-09 上海淇玥信息技术有限公司 Method and device for executing multi-node service task and electronic equipment
CN117112668B (en) * 2023-08-23 2024-02-20 广州嘉磊元新信息科技有限公司 ETL-based RPA flow management method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1897025A (en) * 2006-04-27 2007-01-17 南京联创科技股份有限公司 Parallel ETL technology of multi-thread working pack in mass data process
CN1953490A (en) * 2006-09-06 2007-04-25 南京中兴软创科技有限责任公司 A method to extract and provide the charging data with the technology of ETL
CN101216782A (en) * 2007-12-29 2008-07-09 中国建设银行股份有限公司 Method and system for financial data accomplishing ETL processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1897025A (en) * 2006-04-27 2007-01-17 南京联创科技股份有限公司 Parallel ETL technology of multi-thread working pack in mass data process
CN1953490A (en) * 2006-09-06 2007-04-25 南京中兴软创科技有限责任公司 A method to extract and provide the charging data with the technology of ETL
CN101216782A (en) * 2007-12-29 2008-07-09 中国建设银行股份有限公司 Method and system for financial data accomplishing ETL processing

Also Published As

Publication number Publication date
CN101567013A (en) 2009-10-28
HK1137244A1 (en) 2010-07-23

Similar Documents

Publication Publication Date Title
CN101567013B (en) Method and apparatus for implementing ETL scheduling
US9588822B1 (en) Scheduler for data pipeline
US10101991B2 (en) Managing a software-patch submission queue
CN100487700C (en) Data processing method and system of data library
CN111125444A (en) Big data task scheduling management method, device, equipment and storage medium
US8538793B2 (en) System and method for managing real-time batch workflows
US20100223446A1 (en) Contextual tracing
CN110287052A (en) A kind of root of abnormal task determines method and device because of task
CN107577586B (en) Method and equipment for determining service execution link in distributed system
CN101477524A (en) System performance optimization method and system based on materialized view
CN105630588A (en) Distributed job scheduling method and system
CN110569090A (en) data processing method and device, electronic equipment and storage medium
CN101751288A (en) Method, device and system applying process scheduler
CN105719126A (en) System and method for internet big data task scheduling based on life cycle model
CN104536819A (en) Task scheduling method based on WEB service
CN101639803A (en) Exception handling method and exception handling device for multithread application system
CN111930354B (en) Framework component system for software development and construction method thereof
CN111984447B (en) Registration compensation system and method in overtime or abnormal situation of bank transaction
CN103744730A (en) Task scheduling method and device
US20130239123A1 (en) Milestone manager
CN105446812A (en) Multitask scheduling configuration method
CN112948096A (en) Batch scheduling method, device and equipment
CN105450737B (en) A kind of data processing method, device and system
US20070074225A1 (en) Apparatus, method and computer program product providing integration environment having an integration control functionality coupled to an integration broker
CN105354083A (en) Method and apparatus for checking precondition of scheduling task

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1137244

Country of ref document: HK

C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1137244

Country of ref document: HK

TR01 Transfer of patent right

Effective date of registration: 20191209

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, Cayman Islands

Patentee after: Innovative advanced technology Co., Ltd

Address before: Greater Cayman Islands, British Cayman Islands

Patentee before: Alibaba Group Holding Co., Ltd.

TR01 Transfer of patent right