CN112148442B - ETL flow scheduling method and device - Google Patents

ETL flow scheduling method and device Download PDF

Info

Publication number
CN112148442B
CN112148442B CN202010782562.1A CN202010782562A CN112148442B CN 112148442 B CN112148442 B CN 112148442B CN 202010782562 A CN202010782562 A CN 202010782562A CN 112148442 B CN112148442 B CN 112148442B
Authority
CN
China
Prior art keywords
scheduling
flow
time
delay queue
model object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010782562.1A
Other languages
Chinese (zh)
Other versions
CN112148442A (en
Inventor
梅纲
高东升
黄海明
陈琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Dream Database Co ltd
Original Assignee
Wuhan Dream Database Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Dream Database Co ltd filed Critical Wuhan Dream Database Co ltd
Priority to CN202010782562.1A priority Critical patent/CN112148442B/en
Publication of CN112148442A publication Critical patent/CN112148442A/en
Application granted granted Critical
Publication of CN112148442B publication Critical patent/CN112148442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of data processing, in particular to a method and a device for ETL flow scheduling. Mainly comprises the following steps: creating a delay queue; packaging scheduling tasks to be executed at fixed time into scheduling elements, and putting the scheduling elements into a delay queue, wherein the scheduling elements comprise scheduling time, a flow model object and a scheduling model object; when the scheduling time of the scheduling element is reached, the scheduling element is taken out from the delay queue, and the flow model object of the scheduling element is executed; according to the scheduling model object type of the scheduling element, calculating the scheduling time of the scheduling element at a preset time node, and if the next scheduling time is the effective time, modifying the scheduling time of the scheduling element into the next scheduling time and putting back the next scheduling time into a delay queue so as to facilitate the next scheduling. The invention can conveniently and effectively manage the periodic scheduling of the flow by the delay queue and the object list of the flow scheduling model, and ensures the timeliness and accuracy of the operation of the flow scheduling.

Description

ETL flow scheduling method and device
Technical Field
The invention relates to the field of data processing, in particular to a method and a device for ETL flow scheduling.
Background
When ETL is used for data processing and constructing a data warehouse, the specific flow of heterogeneous data source extraction, cleaning and conversion and loading is generally configured by a flow designer, and is periodically scheduled and executed at the background of a server.
In the existing ETL tool, a Timer mechanism of a programming language is generally used to manage periodic scheduling of a process, and a Timer counts the interval time of a process model object needing periodic scheduling and notifies a system to start executing the process model object after reaching a preset interval time. However, the Timer basic processing model is a task queue model of single-thread scheduling, and when a certain task is executed for a long time, the real-time performance of the whole task execution can be affected. Meanwhile, if the Timer task throws out an anomaly, the whole Timer thread is cancelled, and the task cannot be scheduled any more. In addition, the task flow model objects in the task queue cannot be queried and analyzed, and an effective flow scheduling model object list cannot be dynamically adjusted for periodic scheduling of the flow objects.
In view of this, how to overcome the defects existing in the prior art, and avoid the problems existing when using Timer for scheduling, is a problem to be solved in the technical field.
Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention solves the problems of wrong scheduling time interval and insufficient scheduling instantaneity possibly caused by using Timer scheduling.
Furthermore, the problems that the effective flow scheduling model object needs to be changed synchronously due to the dynamic change of the flow scheduling model object and the scheduling model object in ETL scheduling, the flow scheduling model object cannot be inquired, and the exception cannot be handled are solved.
The embodiment of the invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for ETL flow scheduling, specifically: creating a delay queue; packaging scheduling tasks to be executed at fixed time into scheduling elements, and putting the scheduling elements into a delay queue, wherein the scheduling elements comprise scheduling time, a flow model object and a scheduling model object; when the scheduling time of the scheduling element is reached, the scheduling element is taken out from the delay queue, and the flow model object of the scheduling element is executed; according to the scheduling model object type of the scheduling element, calculating the scheduling time of the scheduling element at a preset time node, and if the next scheduling time is the effective time, modifying the scheduling time of the scheduling element into the next scheduling time and putting back the next scheduling time into a delay queue so as to facilitate the next scheduling.
Preferably, the calculating the scheduling time of the scheduling element at the preset time node specifically includes: if the scheduling model object type of the scheduling element is a fixed time point or each interval time, calculating the scheduling time of the next scheduling by taking the current scheduling time as a reference; and if the scheduling model object type of the scheduling element is that the scheduling model object type is that the flow model object is executed at intervals, waiting for the completion of the execution of the flow model object, and when the execution of the flow model object is completed, calculating the next scheduling time by taking the time of the completion of the execution of the flow model object as a reference.
Preferably, the method further comprises a flow scheduling model list and a previous flow scheduling model list; when the system scheduling task changes, modifying corresponding scheduling elements in the delay queue according to the flow scheduling model list and the previous flow scheduling model list; modifying the scheduling time of the scheduling element into the next scheduling time and before putting back the scheduling time into a delay queue, judging whether the scheduling element needing to be put back exists in a flow scheduling model list or not; if the flow scheduling model object corresponding to the scheduling element exists in the flow scheduling model list, the scheduling element is put back into the delay queue; if the flow scheduling model object corresponding to the scheduling element does not exist in the flow scheduling model list, the scheduling element is not put back into the delay queue.
Preferably, the modification of the corresponding scheduling element in the delay queue according to the flow scheduling model list and the previous flow scheduling model list specifically includes: refreshing a flow scheduling model list and a previous flow scheduling model list, wherein the flow scheduling model list stores flow scheduling model objects after the system scheduling tasks are changed, and the previous flow scheduling model list stores flow scheduling models before the system scheduling tasks are changed; comparing different flow scheduling model objects in the flow scheduling model list and the previous flow scheduling model list; if the flow scheduling model object exists in the flow scheduling model list but does not exist in the previous flow scheduling model list, adding a scheduling element corresponding to the flow scheduling model object into a delay queue; if the flow scheduling model object does not exist in the flow scheduling model list but exists in the previous flow scheduling model list, deleting the scheduling element corresponding to the flow scheduling model object from the delay queue.
Preferably, the method further comprises the step of scheduling task acquisition threads; the scheduling task obtaining thread packages the scheduling task needing to be scheduled at fixed time into a scheduling element, puts the scheduling element into a delay queue, and completes the method for modifying the corresponding scheduling element in the delay queue according to the change of the scheduling task of the system.
Preferably, the method further comprises a delay queue reading thread; the delay queue read thread accesses the delay queue by using a loop wait read mode, and when the scheduling time of the scheduling element is reached, the scheduling element is fetched from the delay queue.
Preferably, after the delay queue reading thread takes out the scheduling element from the delay queue, the scheduling task corresponding to the scheduling element is executed in an asynchronous submission mode.
Preferably, the delay queue read thread further comprises a schedule element read interface to facilitate reading of the schedule element.
Preferably, the method further comprises an exception handling thread, wherein the exception handling thread obtains the scheduling state of the flow model object of the scheduling element through the scheduling element reading interface, and processes the scheduling element corresponding to the flow model object with the exception.
On the other hand, the invention provides an ETL flow scheduling device, which specifically comprises the following steps: the method comprises the steps of connecting at least one processor with a memory through a data bus, wherein the memory stores instructions executed by the at least one processor, and the instructions are used for completing the ETL flow scheduling method provided by the first aspect after being executed by the processor.
Compared with the prior art, the embodiment of the invention has the beneficial effects that: the scheduling task acquisition is carried out by using the independent delay queue, so that the flow scheduling execution and the scheduling task acquisition are mutually independent, the execution completion of a flow model object is not required to be waited, and the reduction of the real-time performance of the flow caused by overlong execution time of a certain flow model object is avoided. The periodic scheduling of the flow can be conveniently and effectively managed through the delay queue and the effective flow scheduling model object list, and timeliness and accuracy of flow scheduling operation are ensured.
Further, the embodiment of the invention ensures real-time synchronization of the scheduling flow and the flow set by the user by carrying out real-time adjustment on the scheduling elements in the delay queue, inquires the state of the flow model object and the scheduling model object by providing a flow model object inquiry interface, and avoids scheduling errors caused by thread abnormality of the delay queue by providing an abnormality processing interface.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below. It is evident that the drawings described below are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a flowchart of a method for ETL flow scheduling according to an embodiment of the present invention;
fig. 2 is a flowchart of another method for ETL flow scheduling according to an embodiment of the present invention;
fig. 3 is a flowchart of another method for ETL flow scheduling according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus for ETL flow scheduling according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The present invention is an architecture of a specific functional system, so that in a specific embodiment, functional logic relationships of each structural module are mainly described, and specific software and hardware implementations are not limited.
In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other. The invention will be described in detail below with reference to the drawings and examples.
Example 1:
when the ETL tool is used for carrying out flow scheduling, for the scheduling tasks which are scheduled periodically, timing is needed, and when a preset scheduling time point is reached, scheduling execution is carried out on the flow model objects and the scheduling objects corresponding to the scheduling tasks. However, existing solutions that use Timer to count are based on absolute time, do not capture anomalies, and are single-threaded scheduling, resulting in failure to start executing the flow model object at the correct time. Therefore, the present embodiment provides a new scheduling manner, so as to avoid incorrect scheduling that may occur when using the Timer to schedule.
As shown in fig. 1, the method for ETL flow scheduling provided by the embodiment of the present invention specifically includes the following steps:
step 101: a delay queue is created.
In the current flow scheduling method, a basic processing model of a Timer is a task queue model of single-thread scheduling, the Timer continuously receives scheduled tasks, all the tasks which are received to be scheduled by the Timer are added into a task queue, and the Timer thread takes the tasks from the task queue according to scheduling time to execute. The disadvantage of this method is that, because the Timer thread can only process one task at the same time, when the execution time of a certain task is longer and exceeds the execution time of the next task in the task queue, the subsequent task will not be scheduled, and the real-time performance of the execution of the subsequent task will be affected. In the method provided by the embodiment, the scheduling task is organized by using the delay queue, so that the storage and execution of the scheduling task are separated, and the mutual influence of the execution of the flow task corresponding to the scheduling task and the timing of the scheduling task is avoided. Specifically, in some implementation scenarios, the delay queue may be implemented using a mature library such as a delay queue of JDK, so as to improve execution stability and facilitate management of related resources, where the delay queue in JDK can only be taken away when an object expires, and thus is generally used for caching and cleaning. In the method provided in this embodiment, the delay queue is used to store the scheduled task, and the scheduled task in the delay queue can be taken out when the scheduled time is reached.
In an actual usage scenario, ETL flow scheduling is performed in a scheduler. The ETL service system starts up the scheduler at the same time, and initializes the delay queue delayQueue through the scheduler. In a scenario where a flow scheduling model list effect flow schedule list and a previous flow scheduling model list previousflowschedule list are required to be used, the flow scheduling model list and the previous flow scheduling model list also need to be initialized.
Step 102: and packaging the scheduling tasks needing to be executed at fixed time into scheduling elements, and putting the scheduling elements into a delay queue, wherein the scheduling elements comprise scheduling time, a flow model object and scheduling model object types.
In order to save various attributes of the scheduling task in the delay queue, the scheduling time, the flow model object, the scheduling model object and other attributes are packaged into scheduling elements and placed in the delay queue, and the scheduling elements are convenient to call in subsequent steps. The flow model object is a specific ETL flow corresponding to the scheduling task; the scheduling model object type comprises a periodic scheduling type, such as each interval time, effective date time and the like, and also comprises non-periodic one-time scheduling; the scheduling time is a time node for starting execution of the scheduling task, the object type of the scheduling model is that the scheduling time is calculated according to the relative interval time between two scheduling times when the scheduling model is in each interval time or every interval time.
In a specific usage scenario of the present embodiment, an attribute startTime of the scheduling element ScheduledTask represents a scheduling time; the flowBean is a flow model object and is specifically configured with flow attributes; the schedule bean is a scheduling model object, and is specifically configured with detailed scheduling information. The scheduling elements are arranged in the delay queues according to the order of arrival of the scheduling time, and the fastest scheduling time is arranged at the forefront and forefront queues of the queues. If the scheduler stops working, all the stopped scheduling elements in the delay queue need to be fetched, and the access delay queue is stopped.
Step 103: and when the scheduling time of the scheduling element is reached, the scheduling element is taken out from the delay queue, and the flow model object of the scheduling element is executed.
The schedule time of the arrival schedule element indicates the time when the flow model object of the arrival schedule element starts to execute, the schedule element does not need to be stored in the delay queue for waiting, the schedule element needs to be taken out from the delay queue, and the flow model object of the schedule element is executed.
Step 104: according to the scheduling model object of the scheduling element, calculating the scheduling time of the scheduling element at a preset time node, and if the next scheduling time is the effective time, modifying the scheduling time of the scheduling element into the next scheduling time and putting back the next scheduling time into a delay queue so as to facilitate the next scheduling.
In order to perform periodic repeated scheduling on the scheduling element of which the scheduling model object is of a periodic scheduling type, after the scheduling element is fetched, the scheduling element needs to be put back into the delay queue to wait for the next scheduling. The scheduling elements of different scheduling model objects are different in relative time calculation modes, so that different preset time nodes are required to be set according to different scheduling model objects, new scheduling time of the scheduling elements is calculated at the corresponding preset time nodes, the scheduling time in the scheduling elements is modified according to the new scheduling time, and the scheduling elements with the modified scheduling time are put back into a delay queue. For a scheduling element with a scheduling time limit, before the scheduling element is put back into the delay queue, it is further required to determine whether the next scheduling time is within a valid scheduling time limit range. If the calculated scheduling time is within the effective scheduling time limit range, the effective time is used as the effective time, and the effective time can be put back into a delay queue for next scheduling; if the calculated scheduling time is not within the effective scheduling time limit range, the next scheduling is not performed and the scheduling time is not put back into the delay queue. The scheduling element placed back in the delay queue is scheduled according to the modified scheduling time. On the other hand, the aperiodic schedule is only scheduled once, and the repeated schedule is not needed, so that the scheduling elements are taken out and the flow model object is executed, and the scheduling elements cannot be put back into the delay queue.
In a specific implementation scenario of this embodiment, the scheduling model object of the periodic scheduling type mainly includes: (1) fixed time points: scheduling according to a fixed absolute time point; (2) per interval time: scheduling the flow model object according to fixed interval time, wherein the time interval for starting the two scheduling is a fixed time period; (3) at intervals: the time when the next scheduled flow task starts to execute is a fixed period of time from the time when the previous scheduled flow task completes.
For different scheduling model objects, the embodiment modifies the scheduling time of the scheduling element according to different preset time nodes:
(1) If the scheduling model object of the scheduling element is a fixed time point or each interval time, when the scheduling element is taken out from the delay queue, calculating the scheduling time of the next scheduling by taking the current scheduling time as a reference. The next scheduling time of the fixed time point or each interval time type is only related to the time of the previous scheduling, so that the time of the next scheduling can be calculated and modified when the previous scheduling is started.
(2) And if the scheduling model object of the scheduling element is at intervals, waiting for the execution of the flow model object to be completed, and when the execution of the flow model object is completed, calculating the next scheduling time by taking the time of the completion of the execution of the flow model object as a reference. The next scheduling time of every interval time type is related to the time point when the execution of the previously scheduled flow task is completed, so that the completion of the execution of the flow task needs to be waited, the time point when the execution is completed is acquired, and the next scheduling time is calculated and modified.
During ETL data processing, dynamic changes may occur to the system scheduling tasks. In order to ensure that the scheduling process is executed correctly, when the system scheduling task changes, the corresponding scheduling element in the delay queue needs to be adjusted according to the change of the system scheduling task. In particular embodiments, the adjustments that need to be made include: adding the added scheduling task into the delay queue, removing the deleted scheduling task from the delay queue, modifying a corresponding scheduling element in the delay queue according to one or more of the scheduling time, the flow model object and the scheduling model object, and the like. The dynamic change of the scheduling process is changed correspondingly through the delay queue, and only the data in the delay queue is required to be changed, so that the dynamic change of the system scheduling task can be managed and changed simply and conveniently, and the executing ETL data processing process cannot be influenced by the dynamic change of the system scheduling task. Specifically, the change of the system scheduling task includes ETL metadata import, flow deletion, scheduling deletion, flow modification, scheduling modification, flow scheduling configuration modification, and the like.
In order to facilitate management of the system scheduling task, the system scheduling system further comprises a flow scheduling model list effect flow scheduling list and a previous flow scheduling model list provisoflowschedulelist, and when the system scheduling task changes, corresponding scheduling elements in the delay queue are modified according to the flow scheduling model list and the previous flow scheduling model list. When the flow scheduling model list is initialized, an effective flow model object and a scheduling model object are obtained from the ETL metadata base, a corresponding flow scheduling model object is generated, and the flow scheduling model object is placed in the flow scheduling model list. In step 104, the scheduling time of the scheduling element is modified to be the next scheduling time and before the scheduling element is put back to the delay queue, whether the scheduling element to be put back exists in the flow scheduling model list is also determined; if the scheduling element exists in the flow scheduling model list, the scheduling element is put back into the delay queue; if the scheduling element does not exist in the flow scheduling model list, the scheduling element is not put back in the delay queue.
In a specific application scenario of this embodiment, the adjustment of the corresponding scheduling element in the delay queue according to the change of the system scheduling task, as shown in fig. 2, specifically includes the following steps:
step 201: refreshing a flow scheduling model list and a previous flow scheduling model list, wherein the flow scheduling model list stores flow scheduling model objects after the system scheduling tasks are changed, and the previous flow scheduling model list stores flow scheduling model objects before the system scheduling tasks are changed.
In order to facilitate comparison of the changes of the scheduled tasks, the scheduled tasks before and after the changes can be respectively stored in different lists, and the change condition of the scheduled tasks is determined by comparing the scheduled elements in the two lists. In a specific implementation scenario of this embodiment, when a scheduling process changes, a previous process scheduling model list may be emptied, scheduling elements in the process scheduling model list are stored in the previous process scheduling model list, then an effective process model object and a scheduling model object are obtained from the ETL metadata base, and a changed process scheduling model object is generated and stored in the process scheduling model list. The flow scheduling model list and the previous flow scheduling model list are refreshed when the ETL system is started and when the system scheduling task is changed.
Step 202: and comparing different flow scheduling model objects in the flow scheduling model list and the previous flow scheduling model list.
Step 203: if the flow scheduling model object exists in the flow scheduling model list, but does not exist in the previous flow scheduling model list, adding a scheduling element corresponding to the flow scheduling model object into a delay queue.
The scheduling elements exist in the flow scheduling model list and do not exist in the previous flow scheduling model list, and the flow scheduling model object is newly added after the system scheduling task changes, so that the scheduling elements corresponding to the flow scheduling model object are added into the delay queue. Only when the flow scheduling model object corresponding to the scheduling element exists in the flow scheduling model list and the scheduling model object of the scheduling element is completely consistent with the scheduling model object of the flow scheduling model object, the scheduling element can be added into the delay queue again.
Step 204: if the scheduling element does not exist in the flow scheduling model list, but exists in the prior flow scheduling model list, the scheduling element is deleted from the delay queue.
The scheduling element exists in the previous flow scheduling model list and does not exist in the flow scheduling model list, which means that the scheduling element is removed after the system scheduling task changes, so that the scheduling element is deleted from the delay queue.
Furthermore, the contents of each scheduling element in the scheduling model list of the flow and the previous scheduling model list can be compared, if the contents of the scheduling elements are different, the scheduling elements are modified, and the corresponding scheduling elements in the delay queue are required to be modified. In a specific implementation manner of this embodiment, if a certain flow scheduling model object exists in both the flow scheduling model list and the previous flow scheduling model list, but the date and time frequency of the scheduling model object of the flow scheduling model object has changed, that is, it means that the two flow scheduling model objects are different, it is necessary to delete the previous scheduling element from the delay queue, and change the new and effective flow scheduling model object into the scheduling element to be added into the delay queue;
the scheduling elements before and after the scheduling task changes are compared through the flow scheduling model list and the previous flow scheduling model list, the scheduling elements with the changes can be accurately found, incremental synchronization is carried out on the scheduling elements in the delay queue, the changes of the delay queue are reduced, and excessive influence on the executing scheduling flow is avoided. The periodic scheduling of the flow is determined by the delay queue and the flow scheduling model list together so as to ensure that only the currently valid scheduling elements are scheduled and avoid scheduling errors.
To isolate the scheduled task acquisition process from other functions in the ETL tool, a separate scheduled task acquisition thread may also be created. The scheduled task obtaining thread encapsulates the scheduled task needing to be scheduled at regular time into a scheduled element, and puts the scheduled element into a delay queue, and completes the step of adjusting the corresponding scheduled element in the delay queue according to the change of the scheduled task of the system in step 201-step 204. The scheduling element and the delay queue are primarily processed through the independent scheduling task acquisition thread, so that the primary processing process and other functions in the ETL tool can be executed in parallel, the efficiency of scheduling task acquisition and dynamic change is improved, and the real-time performance of scheduling is improved.
To facilitate management of the schedule elements deposited in the delay queues, delay queue read threads may also be created. The delay queue read thread accesses the delay queue by using a loop wait read mode, and when the scheduling time of the scheduling element is reached, the scheduling element is fetched from the delay queue. In the implementation of the database scenario, when the ETL tool is started, the flow scheduling configuration information is read, the delay queue is initialized, and meanwhile, the delay queue is started, and the read thread of the delay queue is started to access the delay queue in a cyclic waiting and reading mode, so that timeliness of flow periodic scheduling is ensured.
Furthermore, in order to improve the execution efficiency of the scheduling task, after the delay queue reading thread takes out the scheduling element from the delay queue, the scheduling task corresponding to the scheduling element is executed in an asynchronous submitting mode. The delay queue read thread counts the scheduled tasks, and performs the operations of fetching, modifying, and replacing the scheduled elements in the delay queue in step 103 and step 104 to complete the scheduling of the scheduled tasks in the delay queue.
In order to facilitate management of the schedule element in the delay queue, the delay queue read thread may further include a schedule element read interface, where the schedule element read interface reads the content and the execution state of the schedule element stored in the delay queue. The content of the scheduling element in the delay queue can be conveniently acquired through the scheduling element interface, and the problem that the scheduling task cannot be inquired and analyzed in the existing ETL tool is solved.
In order to process the abnormality in the flow scheduling, the scheduling error caused by the abnormality is avoided, an abnormality processing thread can be created, and the abnormality processing thread can acquire the scheduling state of the flow model object of the scheduling element through the scheduling element reading interface and process the scheduling element corresponding to the flow model object with the abnormality. Such as rescheduling the abnormal flow model object, or recalculating the next scheduling time, etc. The problem that the Timer in the existing ETL tool cannot capture the abnormality is solved through the abnormality processing thread, the abnormality in the scheduling process is processed in time, and the scheduling is ensured to be normally carried out.
According to the flow scheduling method provided by the embodiment, the scheduling tasks are organized through the delay queues, the periodic scheduling is performed, the change of the system scheduling tasks is dynamically managed through the flow scheduling model list and the previous flow scheduling model list, the scheduling tasks are acquired and primarily processed through the scheduling task acquisition threads, the delay queues and the scheduling elements stored in the delay queues are scheduled and managed through the delay queue reading threads, the management complexity of the scheduling tasks is reduced, and the scheduling efficiency is improved. The scheduling element in the delay queue is checked through the scheduling element reading interface, and the exception is processed through the exception processing thread, so that the problems that the exception cannot be captured and the scheduling task state cannot be read when the Timer is used for scheduling are avoided.
Example 2:
based on the method for ETL flow scheduling provided in embodiment 1, in different specific application scenarios, the method can be supplemented and adjusted according to different usage requirements or actual scenarios.
In ETL flow scheduling, one flow model object can be configured with one or more scheduling tasks, and the same flow model object is executed at different scheduling times, such as executing a file Load once and executing a file Load once every interval time; a scheduling task may also be referred to by one or more flow model object configurations, where multiple flow model objects are executed sequentially or concurrently at the same scheduling time, e.g., a preset ETL process is executed at a fixed time point; multiple scheduling tasks may also be configured with multiple flow model objects, such as executing a preset ETL process once per day at the expiration date and time. In step 102, when the scheduling task is encapsulated as a scheduling element, the scheduling task and the flow model object need to be encapsulated in a one-to-one correspondence. In a specific implementation scenario, the flow model object 1 and the flow model object 2 need to be executed in both the scheduled task 1 and the scheduled task 2, the interval time of the scheduled task 1 is 3 hours, the interval time of the scheduled model object is 10 hours, and the interval time of the scheduled task 2 is every interval time. In this implementation scenario, the scheduling elements { scheduling time, flow model object, scheduling model object } that need to be put in the delay queue are: {3 hours, flow model object 1 every interval }, {10 hours, flow model object 1 every interval }, {3 hours, flow model object 2 every interval }, {10 hours, flow model object 2 every interval }. The combination of all scheduled tasks and flow model objects is placed in the delay queue to ensure that each scheduled task configured for each flow model object is executed.
In order to monitor and record the execution condition of the scheduling task, in some specific embodiments, in the case of using the scheduling element reading interface in the flow scheduling method provided in embodiment 1, the execution condition of the scheduling element in the delay queue may be read through the scheduling element reading interface, so as to obtain which state of the states of waiting for scheduling, executing or completing the non-periodic scheduling task is currently used as the flow task in the scheduling element; historical execution of the scheduled tasks may also be saved in the form of log files or the like. The execution condition of the dispatching task is monitored and recorded through the dispatching element reading interface, so that the execution condition of the dispatching task can be monitored and confirmed by the ETL tool, and the execution error of the dispatching task is prevented.
In order to facilitate special handling of the schedule element when an exception occurs, in some implementations, the exception handling thread provided in embodiment 1 may handle exceptions thrown by the flow model object of the schedule element. Specifically, if the thrown exception has a corresponding processing mode, processing is performed according to the corresponding processing mode; if the thrown exception is unknown, the corresponding thread is re-created to continue to execute the flow scheduling.
In some specific embodiments, in the case of using the scheduling element reading interface in the flow scheduling method provided in embodiment 1, when an abnormality occurs in a scheduling flow execution thread, the execution condition of the scheduling element in the delay queue before the occurrence of the abnormality may be obtained according to the record of the execution condition of the scheduling element reading interface on the flow model object of the scheduling task, so as to restore the execution state of the scheduling element. The execution condition record of the scheduling element is acquired and recovered, so that the scheduling element which is caused by the abnormality is prevented from being taken out but not put back, and the influence of the abnormality on the flow scheduling is further reduced.
As shown in fig. 3, the steps after the exception handling operation is added in steps 101 to 104 are as follows:
step 301: a delay queue is created.
Step 302: and packaging the scheduling tasks needing to be executed at fixed time into scheduling elements, and putting the scheduling elements into a delay queue, wherein the scheduling elements comprise scheduling time, a flow model object and a scheduling model object.
Step 303: and when the scheduling time of the scheduling element is reached, the scheduling element is taken out from the delay queue, and the flow model object of the scheduling element is executed.
Step 304: and judging whether the execution of the scheduling element is abnormal or not.
Step 305: if the abnormality occurs, recording the abnormality, and judging whether a corresponding processing mode exists or not.
Step 306: if the thrown exception has a corresponding processing mode, processing according to the corresponding processing mode.
Step 307: and if the thrown exception is unknown, rescheduling the abnormal scheduling element.
Step 308: if no abnormality occurs, the abnormal processing is completed or the corresponding thread is re-created to be completed, the scheduling time of the scheduling element is modified at a preset time node according to the scheduling model object of the scheduling element, and the scheduling element after the scheduling time modification is put back into a delay queue so as to be convenient for next scheduling.
The solution provided in this embodiment provides an implementation method for configuring a plurality of scheduling tasks by a plurality of flow model objects on the basis of embodiment 1, and expands the use scenario of the flow scheduling method. The method for detecting and recording the scheduling task and the specific method for exception handling are also provided, and the accuracy and stability of the flow scheduling method during execution are further improved.
Example 3:
on the basis of the method for ETL flow scheduling provided in the foregoing embodiments 1 to 2, the present invention further provides a device for implementing ETL flow scheduling in the foregoing method, as shown in fig. 4, which is a schematic device architecture diagram of an embodiment of the present invention. The apparatus for ETL flow scheduling in this embodiment includes one or more processors 21 and a memory 22. In fig. 4, a processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or otherwise, for example in fig. 4.
The memory 22 is used as a nonvolatile computer-readable storage medium for storing a nonvolatile software program, a nonvolatile computer-executable program, and a module for the ETL flow scheduling method as in embodiment 1 to embodiment 2. The processor 21 performs various functional applications and data processing of the ETL flow scheduling apparatus, that is, the method of implementing ETL flow scheduling of embodiments 1 to 2, by running nonvolatile software programs, instructions, and modules stored in the memory 22.
The memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 22 may optionally include memory located remotely from processor 21, which may be connected to processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22 and when executed by the one or more processors 21 perform the methods of ETL flow scheduling in embodiments 1-2 described above, for example, performing the steps shown in fig. 1-3 described above.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the embodiments may be implemented by a program that instructs associated hardware, the program may be stored on a computer readable storage medium, the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (10)

1. The ETL flow scheduling method is characterized by comprising the following steps of:
creating a delay queue;
packaging scheduling tasks to be executed at fixed time into scheduling elements, and putting the scheduling elements into a delay queue, wherein the scheduling elements comprise scheduling time, a flow model object and a scheduling model object;
when the scheduling time of the scheduling element is reached, the scheduling element is taken out from the delay queue, and the flow model object of the scheduling element is executed;
acquiring a scheduling time calculation reference of next scheduling according to the scheduling model object type of the scheduling element, calculating the scheduling time of the scheduling element at a preset time node, and if the scheduling time of the next scheduling is effective time, modifying the scheduling time of the scheduling element into the next scheduling time and putting back the scheduling time into a delay queue so as to facilitate the next scheduling;
when the system scheduling task changes, corresponding scheduling elements in the delay queue are modified according to the flow scheduling model list and the previous flow scheduling model list, and then the scheduling time of the scheduling elements is modified to be the next scheduling time and is put back into the delay queue.
2. The method for ETL process scheduling according to claim 1, wherein the obtaining a scheduling time calculation reference for the next scheduling according to the scheduling model object type of the scheduling element specifically includes:
if the scheduling model object type of the scheduling element is a fixed time point or each interval time, calculating the scheduling time of the next scheduling by taking the current scheduling time as a reference;
and if the scheduling model object type of the scheduling element is that the scheduling model object type is that the flow model object is executed at intervals, waiting for the completion of the execution of the flow model object, and when the execution of the flow model object is completed, calculating the next scheduling time by taking the time of the completion of the execution of the flow model object as a reference.
3. The method of ETL process scheduling of claim 1, wherein;
the method for modifying the scheduling time of the scheduling element into the next scheduling time and putting back the next scheduling time into the delay queue further comprises the following steps:
judging whether the scheduling element to be replaced exists in a flow scheduling model list or not;
if the flow scheduling model object corresponding to the scheduling element exists in the flow scheduling model list, the scheduling element is put back into the delay queue;
if the flow scheduling model object corresponding to the scheduling element does not exist in the flow scheduling model list, the scheduling element is not put back into the delay queue.
4. The ETL process scheduling method according to claim 1, wherein the modifying the corresponding scheduling element in the delay queue according to the process scheduling model list and the previous process scheduling model list specifically includes:
refreshing a flow scheduling model list and a previous flow scheduling model list, wherein the flow scheduling model list stores flow scheduling model objects after the system scheduling tasks are changed, and the previous flow scheduling model list stores flow scheduling models before the system scheduling tasks are changed;
comparing different flow scheduling model objects in the flow scheduling model list and the previous flow scheduling model list;
if the flow scheduling model object exists in the flow scheduling model list but does not exist in the previous flow scheduling model list, adding a scheduling element corresponding to the flow scheduling model object into a delay queue;
if the flow scheduling model object does not exist in the flow scheduling model list but exists in the previous flow scheduling model list, deleting the scheduling element corresponding to the flow scheduling model object from the delay queue.
5. The method of ETL process scheduling of claim 4, wherein:
the method also comprises a scheduling task acquisition thread;
the scheduling task obtaining thread encapsulates the scheduling task needing to be scheduled at regular time into a scheduling element, puts the scheduling element into a delay queue, and completes the method for modifying the corresponding scheduling element in the delay queue according to the change of the scheduling task of the system as in claim 4.
6. The ETL process scheduling method according to claim 1, wherein:
the method also comprises a delay queue reading thread;
the delay queue reading thread accesses the delay queue in a cyclic waiting reading mode, and takes out the scheduling element from the delay queue when the scheduling time of the scheduling element is reached.
7. The ETL process scheduling method of claim 6, wherein:
and after the delay queue reading thread takes out the scheduling element from the delay queue, executing the scheduling task corresponding to the scheduling element in an asynchronous submission mode.
8. The ETL process scheduling method of claim 7, wherein:
the delay queue read thread also includes a schedule element read interface to facilitate reading of the schedule element.
9. The ETL process scheduling method of claim 7, wherein:
the system also comprises an exception handling thread, wherein the exception handling thread obtains the scheduling state of the flow model object of the scheduling element through the scheduling element reading interface and processes the scheduling element corresponding to the flow model object with the exception.
10. An ETL process scheduling apparatus, wherein:
the method for scheduling ETL flows according to any one of claims 1-9, comprising at least one processor and a memory, said at least one processor and memory being connected by a data bus, said memory storing instructions for execution by said at least one processor, said instructions, after being executed by said processor, for performing the method for scheduling ETL flows.
CN202010782562.1A 2020-08-06 2020-08-06 ETL flow scheduling method and device Active CN112148442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010782562.1A CN112148442B (en) 2020-08-06 2020-08-06 ETL flow scheduling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010782562.1A CN112148442B (en) 2020-08-06 2020-08-06 ETL flow scheduling method and device

Publications (2)

Publication Number Publication Date
CN112148442A CN112148442A (en) 2020-12-29
CN112148442B true CN112148442B (en) 2023-07-21

Family

ID=73888424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010782562.1A Active CN112148442B (en) 2020-08-06 2020-08-06 ETL flow scheduling method and device

Country Status (1)

Country Link
CN (1) CN112148442B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1318453A1 (en) * 2001-12-07 2003-06-11 Hewlett-Packard Company Scheduling system, method and apparatus for a cluster
CN108304257A (en) * 2018-02-09 2018-07-20 中国船舶重工集团公司第七六研究所 Hard real time hybrid tasks scheduling method based on Delay Service device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8954985B2 (en) * 2012-06-05 2015-02-10 International Business Machines Corporation Dependency management in task scheduling
US9396018B2 (en) * 2014-10-10 2016-07-19 Salesforce.Com, Inc. Low latency architecture with directory service for integration of transactional data system with analytical data structures
CN106020951A (en) * 2016-05-12 2016-10-12 中国农业银行股份有限公司 Task scheduling method and system
CN108733462A (en) * 2017-04-18 2018-11-02 北京京东尚科信息技术有限公司 The method and apparatus of delay task
CN110119323A (en) * 2019-05-13 2019-08-13 重庆八戒电子商务有限公司 It is a kind of to take turns the method and system for executing delay queue based on the time
CN110515709B (en) * 2019-07-25 2022-06-10 北京达佳互联信息技术有限公司 Task scheduling system, method, device, electronic equipment and storage medium
CN111159268B (en) * 2019-12-19 2022-01-04 武汉达梦数据库股份有限公司 Method and device for running ETL (extract-transform-load) process in Spark cluster

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1318453A1 (en) * 2001-12-07 2003-06-11 Hewlett-Packard Company Scheduling system, method and apparatus for a cluster
CN108304257A (en) * 2018-02-09 2018-07-20 中国船舶重工集团公司第七六研究所 Hard real time hybrid tasks scheduling method based on Delay Service device

Also Published As

Publication number Publication date
CN112148442A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
CN106802826B (en) Service processing method and device based on thread pool
CN106201672B (en) Timed task setting system and timed task running method thereof
CN106406993A (en) Timed task management method and system
CN106325984B (en) Big data task scheduling device
CN110895487B (en) Distributed task scheduling system
US20080046785A1 (en) Timeout request scheduling using grouping and nonsynchronized processing to enhance performance
CN109766194B (en) Method and system for realizing low-coupling plan task component based on message
CN111125444A (en) Big data task scheduling management method, device, equipment and storage medium
CN108958789B (en) Parallel stream type computing method, electronic equipment, storage medium and system
WO2020232871A1 (en) Method and device for microservice dependency analysis
CN107784400B (en) Method and device for executing business model
CN110895483A (en) Task recovery method and device
CN110895486B (en) Distributed task scheduling system
CN107015849A (en) The based reminding method and device of timed task
US20150026694A1 (en) Method of processing information, storage medium, and information processing apparatus
CN111324426A (en) ORACLE database task job management scheduling system and method
CN107797856B (en) Scheduled task management and control method and device based on windows service and storage medium
CN112148442B (en) ETL flow scheduling method and device
CN107222555B (en) Message processing method and device
CN110895485A (en) Task scheduling system
CN114579280B (en) Quasi-real-time scheduling method and system
CN116089040A (en) Service flow scheduling method and device, electronic equipment and storage medium
CN116010388A (en) Data verification method, data acquisition server and data verification system
US9465621B2 (en) Priority activation of metrology driver in boot sequence
CN115221116A (en) Data writing method, device and equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 430000 16-19 / F, building C3, future technology building, 999 Gaoxin Avenue, Donghu New Technology Development Zone, Wuhan, Hubei Province

Applicant after: Wuhan dream database Co.,Ltd.

Address before: 430000 16-19 / F, building C3, future technology building, 999 Gaoxin Avenue, Donghu New Technology Development Zone, Wuhan, Hubei Province

Applicant before: WUHAN DAMENG DATABASE Co.,Ltd.

CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Mei Gang

Inventor after: Gao Dongsheng

Inventor after: Huang Haiming

Inventor after: Chen Qi

Inventor before: Fu Quan

Inventor before: Mei Gang

Inventor before: Gao Dongsheng

GR01 Patent grant
GR01 Patent grant