CN110659116A - Big data ETL task scheduling method - Google Patents

Big data ETL task scheduling method Download PDF

Info

Publication number
CN110659116A
CN110659116A CN201910752971.4A CN201910752971A CN110659116A CN 110659116 A CN110659116 A CN 110659116A CN 201910752971 A CN201910752971 A CN 201910752971A CN 110659116 A CN110659116 A CN 110659116A
Authority
CN
China
Prior art keywords
pool
flow
scheduling
task
waiting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910752971.4A
Other languages
Chinese (zh)
Inventor
朱小杰
沈志宏
杜一
周园春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201910752971.4A priority Critical patent/CN110659116A/en
Publication of CN110659116A publication Critical patent/CN110659116A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Abstract

The invention provides a scheduling method of big data ETL tasks, which relates to the technical field of big data, production lines and task scheduling, and mainly organizes a plurality of ETL tasks arranged by users according to layers, schedules and executes the ETL tasks according to the organized layers and sequences, so that the ETL tasks can be organized and scheduled and executed in a hierarchical mode, thereby being beneficial to reducing manual intervention and lowering operation and maintenance cost.

Description

Big data ETL task scheduling method
Technical Field
The invention relates to the technical field of big data, assembly line and task scheduling, in particular to a method for scheduling an Extract-Transform-Load (ETL) task of big data.
Background
Under some complex scenes, a single pipeline cannot meet the requirement, and some complex logics need to be completed by the cooperation of a plurality of pipelines. And needs to be increased according to the change of actual services. The operation of the assembly line relates to the processes of data acquisition, cleaning, aggregation and the like, and the sequence exists. When the assembly line is continuously increased, the assembly lines are grouped, and scheduling between the assembly lines and between assembly line groups is supported, so that manual intervention can be greatly reduced, and later-stage operation and maintenance efficiency is improved.
The published Chinese invention discloses an automatic scheduling method for multi-level data conversion tasks (application number: 201610066935.9), which proposes to perform layering according to data levels and set priorities for scheduling, but does not propose a scheduling strategy for the sequence of data conversion tasks in the same layer. Also, for example, an ETL-based data task scheduling method and system (application number: 201710162185.X) divides a data task into four layers of get, dwsdata, dwddata and dwmart, and completes a single ETL task from data source acquisition, data storage, data cleaning and data conversion, but does not describe scheduling of multiple pipelines. And for another example, a big data ETL task scheduling method and device (application number: 201710570908.X) solves the problem of data confusion caused by timing scheduling of a plurality of pipelines, but does not describe the problem of complex scheduling among a plurality of pipelines. As another example, "a method and system for arranging large-data ETL tasks" (application No. 201910359658.4) proposes an arrangement method and system for a single pipeline and a pipeline model description language ETLDL, but does not describe multiple pipelines and scheduling among pipelines. If the assembly line is manually scheduled to operate in an actual scene, the labor cost is increased, and meanwhile, the error rate is greatly increased.
Therefore, how to schedule a plurality of pipelines and among the pipeline groups to reduce manual intervention and reduce operation and maintenance costs is an urgent problem to be solved.
Disclosure of Invention
The invention aims to provide a scheduling method of big data ETL tasks, which organizes and schedules the ETL tasks in a hierarchical manner, is beneficial to reducing manual intervention and reduces operation and maintenance cost.
In order to achieve the purpose, the invention adopts the following technical scheme:
a scheduling method of big data ETL tasks comprises the following steps:
1) organizing a plurality of ETL tasks arranged by a user according to a hierarchy;
2) the ETL tasks are scheduled and executed according to the organized hierarchy and order.
Further, the above method organizes the ETL tasks in three layers of Project, FlowGroup, and Flow. Flow is the basic unit of pipeline scheduling, namely the ETL task. The pipelines with similar functions in the same Project are organized into a Flowgroup, and Project is a set of Flowgroup and Flow facing each other at the Project level.
Furthermore, the hierarchical representation of the method is expanded by adopting a pipeline model description language ETLDL. The description languages of Flow, FlowGroup and Project are given in fig. 1, 2 and 3, respectively. The FlowGroup contains a plurality of flows, and defines an execution order among the flows. Project contains multiple flows and flowgroups and defines the execution order. The extended model description language adopts rectangle boxes of 'S', 'C' and 'A' to respectively represent three models of 'Sequence', 'Choice' and 'All' when the XML Schema is described.
Further, the method adopts Condition to indicate the execution sequence between Flow and Flow, and between Flow and Flow group. The execution sequence is described by using Current and After. Wherein Current represents the Current Flow/FlowGroup, and After represents the Flow/FlowGroup that must be executed before executing the Current Flow/FlowGroup, as shown in fig. 4.
Further, the method also comprises the timing scheduling of the pipeline. Fig. 5 shows a model description of timing scheduling, where Expression represents a timing scheduling policy, and is described by using a cron (command Run On UNIX scheduler) Expression. Entry is an execution entity, including one of Project, FlowGroup, and Flow.
Further, the method implements scheduling based on the status pool and the listening mechanism, as shown in fig. 6. And placing the pipeline into three resource pools according to different pipeline states, wherein the three resource pools are a Waiting scheduling resource Pool Waiting Pool, a running resource Pool Started Pool and a Completed resource Pool Completed Pool. Two monitors, ConditionMonitor and Task Monitor are set simultaneously. The Condition Monitor is responsible for pulling pipelines in the Waiting Pool and the Completed Pool, interpreting the pipelines meeting the After Condition, starting the pipelines and placing the pipelines into the Started Pool. The TaskMonitor pulls the pipeline from the Started Pool and places it into the Completed Pool if the pipeline execution is complete. And ending the whole scheduling task until the Waiting Pool is empty.
The present invention also provides a scheduling system for big data ETL tasks, the system comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for performing the steps of the above method.
The present invention also provides a computer readable storage medium storing a computer program comprising instructions which, when executed by a processor of a system, cause the system to perform the steps of the above-described method.
The invention has the following beneficial effects:
the scheduling method of the big data ETL task can organize the assembly line in multiple layers, support sequential scheduling and timing scheduling of the assembly line, greatly reduce labor cost and improve later operation and maintenance efficiency.
Drawings
Fig. 1 is a Flow description diagram based on the model description language ETLDL.
FIG. 2 is a schematic diagram of FlowGroup description based on the model description language ETLDL.
Fig. 3 is a schematic diagram of Project description based on model description language ETLDL.
Fig. 4 is a Condition description diagram based on the model description language ETLDL.
Fig. 5 is a schematic diagram of timing scheduling based on model description language ETLDL.
Fig. 6 is a schematic diagram of pipeline scheduling.
Fig. 7 is a schematic diagram of an embodiment of pipeline scheduling.
FIG. 8 is a diagram illustrating an initial state of an embodiment of pipeline scheduling.
FIG. 9 is a diagram illustrating an operation status of an embodiment of pipeline scheduling.
FIG. 10 is a diagram illustrating a second operation state of the pipeline scheduling embodiment.
FIG. 11 is a diagram illustrating a third operation state of an embodiment of pipeline scheduling.
FIG. 12 is a diagram illustrating the operation state four of the pipeline scheduling embodiment.
FIG. 13 is a diagram illustrating the operation state five of the pipeline scheduling embodiment.
FIG. 14 is a schematic diagram of a pipeline scheduling embodiment scheduling completion.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The embodiment provides a scheduling method of a big data ETL task. The method comprises the following specific steps:
1) and the user carries out task arrangement on Project, Flowgroup and Flow according to requirements.
2) The method places a production line arranged by a user into three resource pools according to different states, namely a Waiting scheduling resource Pool Waiting for scheduling, a starting Pool running, and a Completed resource Pool completing Pool. Two monitors, Condition Monitor and Task Monitor are set simultaneously. The Condition Monitor is responsible for pulling pipelines in the Waiting Pool and the Completed Pool, interpreting the pipelines meeting the After Condition, starting the pipelines and placing the pipelines into the Started Pool. The Task Monitor pulls the pipeline from the Started Pool and places it into the Completed Pool if the Task is Completed. And ending the whole scheduling task until the Waiting Pool is empty.
3) The method adopts a known CRON mode to carry out timing scheduling on Project, Flowgroup and Flow.
The method is described by taking a specific requirement of a certain user as an example. A certain user specific requirement is described below: the user establishes a project P1 for completing the demand, and 6 pipelines are required to be arranged, namely F1, F2, F3, F4, F5 and F6. Wherein F2, F3 and F4 are pipelines with similar functions and are arranged in the same pipeline group G1. The pipeline in G1 has a sequential execution order: f3AfterF2 and F4After F2. While the execution order within project P1 is: g1After F1, F5After G1, F6After G1. See figure 7 for details.
Project P1 is run first, and all flows and Flowgroups in P1 go to Waiting Pool, see FIG. 8.
Then, the Condition Monitor pulls the pipelines in the Waiting Pool and the Completed Pool to find all the pipelines satisfying the After Condition. At this point F1 satisfied the condition, F1 was initiated and added to Started Pool, see FIG. 9.
Thirdly, the Task Monitor pulls the pipeline in the Started Pool and judges whether the pipeline is executed completely. F1 is executed, and then the program enters into Completed Pool, see FIG. 10.
Fourth, Condition Monitor pulls pipelines G1, F5, F6 in Waiting Pool, and F1 in Completed Pool. Find all pipelines that satisfy the After condition. At this point G1 satisfied the condition, G1 was initiated and Started Pool was added. The schematic is shown in FIG. 11.
Fifthly, the Task Monitor pulls the pipeline in the Started Pool and judges whether the pipeline is executed completely. When G1 is finished, it enters into Completed Pool, see FIG. 12.
Sixth, Condition Monitor pulls pipelines F5, F6 in Waiting Pool, and pipelines F1, G1 in Completed Pool. Find all pipelines that satisfy the After condition. At this point both F5 and F6 met the condition, F5 and F6 were Started and Started Pool was added. The schematic is shown in FIG. 13.
Seventh, the Task Monitor pulls the pipeline in the Started Pool, and determines whether the pipeline is executed completely. After the execution of F5 and F6 is Completed, the program enters into Completed Pool, see FIG. 14.
Finally, the Waiting Pool is empty and the entire P1 schedule ends. The execution of G1 is consistent with the execution policy of P1, and is not described again.
In the invention, a scheduling strategy based on the After configuration can also adopt other modes such as Beform (representing the preposition relation of pipelines) for scheduling.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (7)

1. A scheduling method of big data ETL task is characterized by comprising the following steps:
carrying out hierarchical representation on a plurality of ETL tasks arranged by a user, organizing according to the hierarchy, and defining an execution sequence;
and scheduling and executing the ETL tasks according to the organized hierarchy and the defined sequence.
2. The method of claim 1, wherein a plurality of ETL tasks orchestrated for a user are organized in three levels of Project, FlowGroup, and Flow; wherein, the Flowgroup comprises a plurality of flows and defines the execution sequence among the flows; project contains multiple flows and flowgroups and defines the execution order.
3. The method of claim 1 or 2, wherein the hierarchical representation is extended by using a pipeline model description language ETLDL, and the extended model description language uses rectangular boxes to represent three models of "Sequence", "Choice" and "All" in the description of the XML Schema.
4. The method of claim 2, wherein Project, FlowGroup and Flow are scheduled in timing using CRON.
5. The method of claim 2, wherein the scheduling is implemented based on a status Pool and a listening mechanism, and the Flow is placed into three resource pools according to different Flow statuses, namely Waiting to schedule the resource Pool Waiting Pool, running resource Pool Started Pool, Completed resource Pool; simultaneously setting two monitors which are respectively a Condition Monitor and a Task Monitor, wherein the Condition Monitor is responsible for pulling the Flow in the Waiting Pool and the Completed Pool, judging the Flow meeting the After Condition, starting the Flow and placing the Flow in the Started Pool; and the Task Monitor pulls the Flow in the Started Pool, and if the Flow is executed completely, the Flow is placed in the Completed Pool until the Waiting Pool is empty, and the whole scheduling Task is finished.
6. A scheduling system for big-data ETL tasks, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for performing the steps of the method of any of claims 1 to 5.
7. A computer-readable storage medium storing a computer program, characterized in that the computer program comprises instructions which, when executed by a processor of a system, cause the system to perform the steps of the method of any of claims 1 to 5.
CN201910752971.4A 2019-08-15 2019-08-15 Big data ETL task scheduling method Pending CN110659116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910752971.4A CN110659116A (en) 2019-08-15 2019-08-15 Big data ETL task scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910752971.4A CN110659116A (en) 2019-08-15 2019-08-15 Big data ETL task scheduling method

Publications (1)

Publication Number Publication Date
CN110659116A true CN110659116A (en) 2020-01-07

Family

ID=69037500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910752971.4A Pending CN110659116A (en) 2019-08-15 2019-08-15 Big data ETL task scheduling method

Country Status (1)

Country Link
CN (1) CN110659116A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8387066B1 (en) * 2007-09-28 2013-02-26 Emc Corporation Dependency-based task management using set of preconditions to generate scheduling data structure in storage area network
CN104252381A (en) * 2013-06-30 2014-12-31 北京百度网讯科技有限公司 Method and equipment for scheduling ETL (Extraction-Transform-Loading) task
CN106371918A (en) * 2016-08-23 2017-02-01 北京云纵信息技术有限公司 Task cluster scheduling management method and apparatus
CN107423122A (en) * 2017-07-25 2017-12-01 苏州博纳讯动软件有限公司 A kind of complicated O&M operation layout and scheduling system and method
US20180150529A1 (en) * 2016-11-27 2018-05-31 Amazon Technologies, Inc. Event driven extract, transform, load (etl) processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8387066B1 (en) * 2007-09-28 2013-02-26 Emc Corporation Dependency-based task management using set of preconditions to generate scheduling data structure in storage area network
CN104252381A (en) * 2013-06-30 2014-12-31 北京百度网讯科技有限公司 Method and equipment for scheduling ETL (Extraction-Transform-Loading) task
CN106371918A (en) * 2016-08-23 2017-02-01 北京云纵信息技术有限公司 Task cluster scheduling management method and apparatus
US20180150529A1 (en) * 2016-11-27 2018-05-31 Amazon Technologies, Inc. Event driven extract, transform, load (etl) processing
CN109997126A (en) * 2016-11-27 2019-07-09 亚马逊科技公司 Event-driven is extracted, transformation, loads (ETL) processing
CN107423122A (en) * 2017-07-25 2017-12-01 苏州博纳讯动软件有限公司 A kind of complicated O&M operation layout and scheduling system and method

Similar Documents

Publication Publication Date Title
CN103984818B (en) AUV (autonomous underwater vehicle) design flow visualization modeling method based on Flex technology
CN102508639B (en) Distributed parallel processing method based on satellite remote sensing data characteristics
CN103714137B (en) The delet method and system of a kind of data file
CN104915378A (en) Rapid statistics task generation system and method suitable for big data
CN102624865B (en) Cluster load prediction method and distributed cluster management system
WO2015066979A1 (en) Machine learning method for mapreduce task resource configuration parameters
WO2019047441A1 (en) Communication optimization method and system
CN114610474B (en) Multi-strategy job scheduling method and system under heterogeneous supercomputing environment
CN112148788A (en) Data synchronization method and system for heterogeneous data source
CN109324880A (en) A kind of low-power consumption scheduling method suitable for real-time system periodic task model
CN111198754B (en) Task scheduling method and device
CN111858027A (en) Software robot cooperative processing method and system
CN107832130A (en) A kind of job stream scheduling of banking system performs method, apparatus and electronic equipment
CN114374692A (en) Method and system for realizing multi-container cluster management
CN110852623A (en) BPMN-based command control process design method
CN106021100B (en) A kind of test assignment traffic control method for supporting concurrent testing
CN105335135B (en) Data processing method and central node
CN110659116A (en) Big data ETL task scheduling method
CN110175943A (en) Methods, devices and systems and storage medium for intelligent course management
CN103810258A (en) Data aggregation scheduling method based on data warehouse
CN104917733A (en) Time interval attribute recognition method and system
CN106055862A (en) Novel efficient heuristic-type two-stage parallel branch-and-bound method
CN114706672A (en) Satellite autonomous mission planning system and method based on event-driven dynamic assembly
CN109840184B (en) Scheduling method, system and equipment for operation display of power grid equipment
CN111582664A (en) Item management method based on TOC key chain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200107