CN110659116A

CN110659116A - Big data ETL task scheduling method

Info

Publication number: CN110659116A
Application number: CN201910752971.4A
Authority: CN
Inventors: 朱小杰; 沈志宏; 杜一; 周园春
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2020-01-07

Abstract

The invention provides a scheduling method of big data ETL tasks, which relates to the technical field of big data, production lines and task scheduling, and mainly organizes a plurality of ETL tasks arranged by users according to layers, schedules and executes the ETL tasks according to the organized layers and sequences, so that the ETL tasks can be organized and scheduled and executed in a hierarchical mode, thereby being beneficial to reducing manual intervention and lowering operation and maintenance cost.

Description

Big data ETL task scheduling method

Technical Field

The invention relates to the technical field of big data, assembly line and task scheduling, in particular to a method for scheduling an Extract-Transform-Load (ETL) task of big data.

Background

Under some complex scenes, a single pipeline cannot meet the requirement, and some complex logics need to be completed by the cooperation of a plurality of pipelines. And needs to be increased according to the change of actual services. The operation of the assembly line relates to the processes of data acquisition, cleaning, aggregation and the like, and the sequence exists. When the assembly line is continuously increased, the assembly lines are grouped, and scheduling between the assembly lines and between assembly line groups is supported, so that manual intervention can be greatly reduced, and later-stage operation and maintenance efficiency is improved.

The published Chinese invention discloses an automatic scheduling method for multi-level data conversion tasks (application number: 201610066935.9), which proposes to perform layering according to data levels and set priorities for scheduling, but does not propose a scheduling strategy for the sequence of data conversion tasks in the same layer. Also, for example, an ETL-based data task scheduling method and system (application number: 201710162185.X) divides a data task into four layers of get, dwsdata, dwddata and dwmart, and completes a single ETL task from data source acquisition, data storage, data cleaning and data conversion, but does not describe scheduling of multiple pipelines. And for another example, a big data ETL task scheduling method and device (application number: 201710570908.X) solves the problem of data confusion caused by timing scheduling of a plurality of pipelines, but does not describe the problem of complex scheduling among a plurality of pipelines. As another example, "a method and system for arranging large-data ETL tasks" (application No. 201910359658.4) proposes an arrangement method and system for a single pipeline and a pipeline model description language ETLDL, but does not describe multiple pipelines and scheduling among pipelines. If the assembly line is manually scheduled to operate in an actual scene, the labor cost is increased, and meanwhile, the error rate is greatly increased.

Therefore, how to schedule a plurality of pipelines and among the pipeline groups to reduce manual intervention and reduce operation and maintenance costs is an urgent problem to be solved.

Disclosure of Invention

The invention aims to provide a scheduling method of big data ETL tasks, which organizes and schedules the ETL tasks in a hierarchical manner, is beneficial to reducing manual intervention and reduces operation and maintenance cost.

In order to achieve the purpose, the invention adopts the following technical scheme:

a scheduling method of big data ETL tasks comprises the following steps:

1) organizing a plurality of ETL tasks arranged by a user according to a hierarchy;

2) the ETL tasks are scheduled and executed according to the organized hierarchy and order.

Further, the above method organizes the ETL tasks in three layers of Project, FlowGroup, and Flow. Flow is the basic unit of pipeline scheduling, namely the ETL task. The pipelines with similar functions in the same Project are organized into a Flowgroup, and Project is a set of Flowgroup and Flow facing each other at the Project level.

Furthermore, the hierarchical representation of the method is expanded by adopting a pipeline model description language ETLDL. The description languages of Flow, FlowGroup and Project are given in fig. 1, 2 and 3, respectively. The FlowGroup contains a plurality of flows, and defines an execution order among the flows. Project contains multiple flows and flowgroups and defines the execution order. The extended model description language adopts rectangle boxes of 'S', 'C' and 'A' to respectively represent three models of 'Sequence', 'Choice' and 'All' when the XML Schema is described.

Further, the method adopts Condition to indicate the execution sequence between Flow and Flow, and between Flow and Flow group. The execution sequence is described by using Current and After. Wherein Current represents the Current Flow/FlowGroup, and After represents the Flow/FlowGroup that must be executed before executing the Current Flow/FlowGroup, as shown in fig. 4.

Further, the method also comprises the timing scheduling of the pipeline. Fig. 5 shows a model description of timing scheduling, where Expression represents a timing scheduling policy, and is described by using a cron (command Run On UNIX scheduler) Expression. Entry is an execution entity, including one of Project, FlowGroup, and Flow.

Further, the method implements scheduling based on the status pool and the listening mechanism, as shown in fig. 6. And placing the pipeline into three resource pools according to different pipeline states, wherein the three resource pools are a Waiting scheduling resource Pool Waiting Pool, a running resource Pool Started Pool and a Completed resource Pool Completed Pool. Two monitors, ConditionMonitor and Task Monitor are set simultaneously. The Condition Monitor is responsible for pulling pipelines in the Waiting Pool and the Completed Pool, interpreting the pipelines meeting the After Condition, starting the pipelines and placing the pipelines into the Started Pool. The TaskMonitor pulls the pipeline from the Started Pool and places it into the Completed Pool if the pipeline execution is complete. And ending the whole scheduling task until the Waiting Pool is empty.

The present invention also provides a scheduling system for big data ETL tasks, the system comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for performing the steps of the above method.

The present invention also provides a computer readable storage medium storing a computer program comprising instructions which, when executed by a processor of a system, cause the system to perform the steps of the above-described method.

The invention has the following beneficial effects:

the scheduling method of the big data ETL task can organize the assembly line in multiple layers, support sequential scheduling and timing scheduling of the assembly line, greatly reduce labor cost and improve later operation and maintenance efficiency.

Drawings

Fig. 1 is a Flow description diagram based on the model description language ETLDL.

FIG. 2 is a schematic diagram of FlowGroup description based on the model description language ETLDL.

Fig. 3 is a schematic diagram of Project description based on model description language ETLDL.

Fig. 4 is a Condition description diagram based on the model description language ETLDL.

Fig. 5 is a schematic diagram of timing scheduling based on model description language ETLDL.

Fig. 6 is a schematic diagram of pipeline scheduling.

Fig. 7 is a schematic diagram of an embodiment of pipeline scheduling.

FIG. 8 is a diagram illustrating an initial state of an embodiment of pipeline scheduling.

FIG. 9 is a diagram illustrating an operation status of an embodiment of pipeline scheduling.

FIG. 10 is a diagram illustrating a second operation state of the pipeline scheduling embodiment.

FIG. 11 is a diagram illustrating a third operation state of an embodiment of pipeline scheduling.

FIG. 12 is a diagram illustrating the operation state four of the pipeline scheduling embodiment.

FIG. 13 is a diagram illustrating the operation state five of the pipeline scheduling embodiment.

FIG. 14 is a schematic diagram of a pipeline scheduling embodiment scheduling completion.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment provides a scheduling method of a big data ETL task. The method comprises the following specific steps:

1) and the user carries out task arrangement on Project, Flowgroup and Flow according to requirements.

2) The method places a production line arranged by a user into three resource pools according to different states, namely a Waiting scheduling resource Pool Waiting for scheduling, a starting Pool running, and a Completed resource Pool completing Pool. Two monitors, Condition Monitor and Task Monitor are set simultaneously. The Condition Monitor is responsible for pulling pipelines in the Waiting Pool and the Completed Pool, interpreting the pipelines meeting the After Condition, starting the pipelines and placing the pipelines into the Started Pool. The Task Monitor pulls the pipeline from the Started Pool and places it into the Completed Pool if the Task is Completed. And ending the whole scheduling task until the Waiting Pool is empty.

3) The method adopts a known CRON mode to carry out timing scheduling on Project, Flowgroup and Flow.

The method is described by taking a specific requirement of a certain user as an example. A certain user specific requirement is described below: the user establishes a project P1 for completing the demand, and 6 pipelines are required to be arranged, namely F1, F2, F3, F4, F5 and F6. Wherein F2, F3 and F4 are pipelines with similar functions and are arranged in the same pipeline group G1. The pipeline in G1 has a sequential execution order: f3AfterF2 and F4After F2. While the execution order within project P1 is: g1After F1, F5After G1, F6After G1. See figure 7 for details.

Project P1 is run first, and all flows and Flowgroups in P1 go to Waiting Pool, see FIG. 8.

Then, the Condition Monitor pulls the pipelines in the Waiting Pool and the Completed Pool to find all the pipelines satisfying the After Condition. At this point F1 satisfied the condition, F1 was initiated and added to Started Pool, see FIG. 9.

Thirdly, the Task Monitor pulls the pipeline in the Started Pool and judges whether the pipeline is executed completely. F1 is executed, and then the program enters into Completed Pool, see FIG. 10.

Fourth, Condition Monitor pulls pipelines G1, F5, F6 in Waiting Pool, and F1 in Completed Pool. Find all pipelines that satisfy the After condition. At this point G1 satisfied the condition, G1 was initiated and Started Pool was added. The schematic is shown in FIG. 11.

Fifthly, the Task Monitor pulls the pipeline in the Started Pool and judges whether the pipeline is executed completely. When G1 is finished, it enters into Completed Pool, see FIG. 12.

Sixth, Condition Monitor pulls pipelines F5, F6 in Waiting Pool, and pipelines F1, G1 in Completed Pool. Find all pipelines that satisfy the After condition. At this point both F5 and F6 met the condition, F5 and F6 were Started and Started Pool was added. The schematic is shown in FIG. 13.

Seventh, the Task Monitor pulls the pipeline in the Started Pool, and determines whether the pipeline is executed completely. After the execution of F5 and F6 is Completed, the program enters into Completed Pool, see FIG. 14.

Finally, the Waiting Pool is empty and the entire P1 schedule ends. The execution of G1 is consistent with the execution policy of P1, and is not described again.

In the invention, a scheduling strategy based on the After configuration can also adopt other modes such as Beform (representing the preposition relation of pipelines) for scheduling.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A scheduling method of big data ETL task is characterized by comprising the following steps:

carrying out hierarchical representation on a plurality of ETL tasks arranged by a user, organizing according to the hierarchy, and defining an execution sequence;

and scheduling and executing the ETL tasks according to the organized hierarchy and the defined sequence.

2. The method of claim 1, wherein a plurality of ETL tasks orchestrated for a user are organized in three levels of Project, FlowGroup, and Flow; wherein, the Flowgroup comprises a plurality of flows and defines the execution sequence among the flows; project contains multiple flows and flowgroups and defines the execution order.

3. The method of claim 1 or 2, wherein the hierarchical representation is extended by using a pipeline model description language ETLDL, and the extended model description language uses rectangular boxes to represent three models of "Sequence", "Choice" and "All" in the description of the XML Schema.

4. The method of claim 2, wherein Project, FlowGroup and Flow are scheduled in timing using CRON.

5. The method of claim 2, wherein the scheduling is implemented based on a status Pool and a listening mechanism, and the Flow is placed into three resource pools according to different Flow statuses, namely Waiting to schedule the resource Pool Waiting Pool, running resource Pool Started Pool, Completed resource Pool; simultaneously setting two monitors which are respectively a Condition Monitor and a Task Monitor, wherein the Condition Monitor is responsible for pulling the Flow in the Waiting Pool and the Completed Pool, judging the Flow meeting the After Condition, starting the Flow and placing the Flow in the Started Pool; and the Task Monitor pulls the Flow in the Started Pool, and if the Flow is executed completely, the Flow is placed in the Completed Pool until the Waiting Pool is empty, and the whole scheduling Task is finished.

6. A scheduling system for big-data ETL tasks, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for performing the steps of the method of any of claims 1 to 5.

7. A computer-readable storage medium storing a computer program, characterized in that the computer program comprises instructions which, when executed by a processor of a system, cause the system to perform the steps of the method of any of claims 1 to 5.