CN110798339A

CN110798339A - Task disaster tolerance method based on distributed task scheduling framework

Info

Publication number: CN110798339A
Application number: CN201910954331.1A
Authority: CN
Inventors: 陈佳佳; 赵京虎; 孙云枫; 季学纯; 马德超; 李�昊; 赵宇; 闫妮
Original assignee: Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd
Current assignee: Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-02-14

Abstract

The invention discloses a task disaster tolerance method based on a distributed task scheduling framework, which comprises the following steps: the method comprises the steps that firstly, a task scheduling center is initialized, and a daemon thread is started in the initialization process and used for monitoring the heartbeat state of an actuator; secondly, registering task information by a user through a task scheduling center; thirdly, the scheduling center submits scheduling requests on time according to Cron configuration of the tasks; fourthly, the actuator receives and operates the scheduling request submitted by the scheduling center; fifthly, if the daemon thread monitors that the executor fails in the task executing process, whether the executor has a task in a running state or not is determined, and if the executor has the task in the running state, the running state of the task is updated; triggering the task to be rescheduled to be operated on an online executor; and sixthly, completing the task execution and returning a scheduling result. The invention solves the problem that the existing distributed task scheduling framework can not process the automatic task recovery of the disaster tolerance scene.

Description

Task disaster tolerance method based on distributed task scheduling framework

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a task disaster tolerance method based on a distributed task scheduling framework.

Background

In an enterprise-level big data platform system, a large number of business-related tasks which need to be scheduled periodically to run are ubiquitous. The tasks are characterized by automatic scheduling, automatic operation and automatic ending according to a certain time rule. Such as periodically updating the sampled data, performing a spreadsheet task at fixed points in the morning each day, periodically generating a database report each month, etc. For the service scenes, a series of open-source distributed task scheduling frameworks exist in the industry at present, such as LTS, XXL-JOB, and Elastic-JOB, and the distributed task scheduling frameworks have good scalability and expansibility, provide a user-friendly operation and maintenance management interface, support dynamic CRUD operation on tasks, and the like, and are a good choice for task scheduling of an enterprise-level large data platform.

The XXL-JOB is a lightweight and easily-extensible distributed task scheduling framework, is simple to operate and convenient to use, and is a popular open-source distributed task scheduling framework at present. The characteristics of XXL-JOB realized in the aspect of task disaster tolerance are as follows: the task scheduling can be dynamically adjusted according to the online condition of the actuator, so that the task is prevented from being scheduled to the actuator with a fault for operation; when the executor running the scheduling task fails, the task management interface provides an operation button of 'end task', and the 'end task' button can be manually clicked to trigger the task to be rescheduled and executed. XXL-JOB provides a method for task disaster tolerance to a certain extent, but can be realized by combining the manual operation of operation and maintenance personnel.

Although some good distributed task scheduling frameworks exist at present, the following problems generally exist in the use of the actual production environment: when a distributed task executor node is disconnected due to a fault or is restarted, a task which is dispatched to the executor node by the dispatching center and is in a running state is hung up, and the execution cannot be automatically resumed. The existing distributed task scheduling framework cannot well solve the problem of task automatic recovery in disaster tolerance scenes, and the reliability of task operation is just an important consideration index of a type-selecting distributed task scheduling system in the industry fields of power grids, banks, insurance and the like. In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The invention aims to provide a task disaster tolerance method based on a distributed task scheduling framework aiming at the problems in the prior art, so as to solve the problems that the running task is dead and cannot be automatically recovered in a disaster tolerance scene.

In order to achieve the purpose, the invention adopts the technical scheme that:

a task disaster tolerance method based on a distributed task scheduling framework comprises the following steps:

s1, deploying a plurality of actuators, wherein the actuators are respectively in communication connection with a dispatching center;

s2, registering task information through a scheduling center, and submitting a scheduling request to an actuator based on Cron configuration of a task;

s3, the executor receives and runs the dispatching request submitted by the dispatching center;

s4, monitoring heartbeat states of a plurality of actuators through a dispatching center;

s5, detecting the fault of the actuator in the process of executing the task, confirming whether the actuator has the task in the running state, and if yes, updating the running state of the task; triggering the task to be rescheduled to be operated on an online executor; if not, refreshing the online state of the actuator;

and S6, the executor completes the scheduling request task and returns the scheduling result.

Specifically, in step S4, the heartbeat state of the actuator is monitored by a daemon thread, and the daemon thread is started in the process of initializing a task scheduling center; the method for monitoring the heartbeat state of the actuator by the daemon thread comprises the following steps: the daemon thread inquires an actuator information registry of the database once every 1 heartbeat cycle, and whether an actuator is disconnected or not is judged according to an update _ time field of the registry; and if the update _ time field in the executor information registry has an executor which is not updated in more than 3 heartbeat cycles, the executor is considered to be in a disconnection state.

Specifically, in step S5, the faults occurring in the actuator include a disconnection fault and a restart fault due to the fault;

further, when the actuator has a disconnection fault, the daemon thread can inquire a scheduling log information table of the database to determine whether a task in a running state exists on the actuator, and if so, the running state of the task is updated to be failed; then, the scheduling center performs retry scheduling according to the retry times of the task configuration, and triggers the task to be rescheduled to an on-line executor to run; and if not, refreshing the actuator information registry.

Furthermore, when the actuator has a restart fault due to a cause, calling a rescheduling service interface of a dispatching center, judging whether the restart time of the actuator exceeds 3 heartbeat cycles through the rescheduling service interface, and if so, classifying the fault of the actuator as offline fault processing; if the restart time does not exceed 3 heartbeat cycles, inquiring a database scheduling log through a rescheduling service interface, confirming whether a task in a running state exists on the actuator, and if so, updating the running state of the task to be failure; then, the scheduling center performs retry scheduling according to the retry times of the task configuration, and triggers the task to be rescheduled to an on-line executor to run; and if not, refreshing the actuator information registry.

Further, performing retry scheduling according to the retry number configured by the task specifically includes: and periodically polling the scheduling log information table of each task in the task monitoring queue by a daemon thread running in the background of the scheduling center, and if a monitoring task in a failure state exists and the retry number is greater than 0, reducing the retry number corresponding to the task by one and then resubmitting the retry number to the task scheduling center to be scheduled and run.

In particular, the heartbeat cycle is 30 s.

Specifically, the dispatch center and the executor perform information registration discovery (i.e., service registration discovery) in a DB manner.

Corresponding to the task disaster tolerance method, the invention also provides a task disaster tolerance system based on the distributed task scheduling framework, which comprises a scheduling center and a plurality of actuators, wherein the plurality of actuators and the scheduling center register and discover information in a DB mode; the scheduling center is used for registering task information and submitting a scheduling request to an actuator based on Cron configuration of a task; the executor is used for receiving and operating a scheduling request; the dispatching center judges whether the actuator has a fault or not by monitoring the heartbeat state of the actuator; when the fact that the actuator breaks down in the task execution process is monitored, whether a task in a running state exists on the actuator is confirmed, and if the task exists, the running state of the task is updated; triggering the task to be rescheduled to be operated on an online executor; if not, the actuator's online status is refreshed.

Specifically, the scheduling center monitors the heartbeat state of the actuator through a daemon thread, and the daemon thread is started in the initialization process of the scheduling center; the method for monitoring the heartbeat state of the actuator by the daemon thread comprises the following steps: the daemon thread inquires an actuator information registry of the database once every 1 heartbeat cycle, and whether an actuator is disconnected or not is judged according to an update _ time field of the registry; and if the update _ time field in the executor information registry has an executor which is not updated in more than 3 heartbeat cycles, the executor is considered to be in a disconnection state.

Specifically, the faults of the actuator include a disconnection fault and a restart fault due to the fault;

when the executor has a disconnection fault, the daemon thread can inquire a scheduling log information table of a database to determine whether a task in a running state exists on the executor, and if so, the running state of the task is updated to be failed; then, the scheduling center performs retry scheduling according to the retry times of the task configuration, and triggers the task to be rescheduled to an on-line executor to run; if not, refreshing the actuator information registry;

when the actuator has a restart fault due to a fault, calling a rescheduling service interface of a dispatching center, judging whether the restart time of the actuator exceeds 3 heartbeat cycles or not through the rescheduling service interface, and if so, classifying the fault of the actuator into offline fault processing; if the restart time does not exceed 3 heartbeat cycles, inquiring a database scheduling log through a rescheduling service interface, confirming whether a task in a running state exists on the actuator, and if so, updating the running state of the task to be failure; then, the scheduling center performs retry scheduling according to the retry times of the task configuration, and triggers the task to be rescheduled to an on-line executor to run; and if not, refreshing the actuator information registry.

In particular, the heartbeat cycle is 30 s.

Specifically, a plurality of executors and the dispatching center perform information registration discovery in a DB mode.

Compared with the prior art, the invention has the beneficial effects that: the invention can timely react and process various types of faults in the distributed task scheduling system, when the actuator has a disconnection fault or the actuator is restarted due to a fault, the task which is being scheduled and executed on the actuator can be automatically restored and executed again without the cooperation of manual operation; the task disaster tolerance method ensures high availability and high reliability of the distributed task scheduling system, and solves the problem that the existing distributed task scheduling framework cannot process automatic task recovery of disaster tolerance scenes.

Drawings

FIG. 1 is a schematic flow chart of a task disaster recovery method based on a distributed task scheduling framework according to the present invention;

FIG. 2 is a detailed flowchart of a task disaster recovery method based on a distributed task scheduling framework according to an embodiment of the present invention;

fig. 3 is a system architecture diagram of a task disaster recovery method based on a distributed task scheduling framework according to the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 and 2, the present embodiment provides a task disaster tolerance method based on a distributed task scheduling framework, including the following steps:

s1, initializing a task scheduling center, and starting a daemon thread in the initialization process for monitoring the heartbeat state of an actuator;

s2, the user registers the task information through the task scheduling center;

s3, the scheduling center submits scheduling requests according to Cron configuration of tasks;

s4, the executor receives and runs the dispatching request submitted by the dispatching center;

s5, if the daemon thread monitors that the executor fails in the process of executing the task, whether the executor has the task in the running state is determined, and if the executor has the task in the running state, the running state of the task is updated; triggering the task to be rescheduled to be operated on an online executor; if not, refreshing the online state of the actuator;

Specifically, in step S1, the method for the daemon thread to monitor the heartbeat state of the actuator includes: the daemon thread inquires an actuator information registry of the database once every 1 heartbeat cycle, and whether an actuator is disconnected or not is judged according to an update _ time field of the registry; if there is an executor whose heartbeat period is not updated (i.e. heartbeat timeout) exceeds 3 heartbeat periods in the update _ time field in the executor information registry, the executor is considered to be in a dropped state.

Specifically, in step S5, the faults occurring in the actuator include a disconnection fault and a restart fault due to the fault.

In particular, the heartbeat cycle is 30 s.

Specifically, the information registration discovery is performed between the dispatch center and the executor in a DB mode.

In this embodiment, the task information registered by the user is task information to be executed by operating a Web interface or by using a built-in library-refreshing script, where a specific task execution logic must be a JobHandler implementation class in which a service has been developed.

As shown in fig. 3, the present embodiment further provides a task disaster recovery system based on a distributed task scheduling framework, where the task disaster recovery system of the present embodiment includes a scheduling center and a plurality of actuators, and the plurality of actuators and the scheduling center perform information registration and discovery in a DB manner; the scheduling center is used for registering task information and submitting a scheduling request to an actuator based on Cron configuration of a task; the executor is used for receiving and operating a scheduling request; the dispatching center judges whether the actuator has a fault or not by monitoring the heartbeat state of the actuator; when the fact that the actuator breaks down in the task execution process is monitored, whether a task in a running state exists on the actuator is confirmed, and if the task exists, the running state of the task is updated; triggering the task to be rescheduled to be operated on an online executor; if not, the actuator's online status is refreshed.

The dispatching center is responsible for managing dispatching information, sending out dispatching requests according to dispatching configuration, and does not bear service codes; the executor is responsible for receiving the scheduling request and executing the task logic; the task executor adopts cluster deployment, and high availability of task execution can be ensured.

The dispatch center includes: the system comprises an executor management module, a task management module, a log management module and other functional modules, wherein the executor management module is used for providing registration service, the task management module is used for providing task scheduling service, the log management module is used for providing log query service, and the scheduling center is also used for providing task callback service.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A task disaster tolerance method based on a distributed task scheduling framework is characterized by comprising the following steps:

s2, registering task information and submitting a scheduling request to an actuator based on Cron configuration of the task;

s4, monitoring the heartbeat states of a plurality of actuators;

2. The task disaster recovery method based on the distributed task scheduling framework according to claim 1, wherein in step S4, the heartbeat state of the actuator is monitored by a daemon thread, and the daemon thread is started in the initialization process of the scheduling center; the method for monitoring the heartbeat state of the actuator by the daemon thread comprises the following steps: the daemon thread inquires an actuator information registry of the database once every 1 heartbeat cycle, and whether an actuator is disconnected or not is judged according to an update _ time field of the registry; and if the update _ time field in the executor information registry has an executor which is not updated in more than 3 heartbeat cycles, the executor is considered to be in a disconnection state.

3. The task disaster recovery method based on the distributed task scheduling framework as claimed in claim 1, wherein in step S5, the failures occurred in the actuator include a disconnection failure and a failure due to restart;

4. The task disaster recovery method based on the distributed task scheduling framework as claimed in claim 2 or 3, wherein the heartbeat period is 30 s.

5. The task disaster recovery method based on the distributed task scheduling framework as claimed in claim 1, wherein the information registration and discovery between the scheduling center and the executor are performed in a DB manner.

6. A task disaster recovery system based on a distributed task scheduling framework is based on the task disaster recovery method of any one of claims 1 to 5, and is characterized by comprising a scheduling center and a plurality of actuators, wherein the plurality of actuators are in communication connection with the scheduling center; the scheduling center is used for registering task information and submitting a scheduling request to an actuator based on Cron configuration of a task; the executor is used for receiving and operating a scheduling request; the dispatching center judges whether the actuator has a fault or not by monitoring the heartbeat state of the actuator; when the fact that the actuator breaks down in the task execution process is monitored, whether a task in a running state exists on the actuator is confirmed, and if the task exists, the running state of the task is updated; triggering the task to be rescheduled to be operated on an online executor; if not, the actuator's online status is refreshed.

7. The task disaster recovery system based on the distributed task scheduling framework according to claim 6, wherein the scheduling center monitors the heartbeat state of the actuator through a daemon thread, and the daemon thread is started in the initialization process of the scheduling center; the method for monitoring the heartbeat state of the actuator by the daemon thread comprises the following steps: the daemon thread inquires an actuator information registry of the database once every 1 heartbeat cycle, and whether an actuator is disconnected or not is judged according to an update _ time field of the registry; and if the update _ time field in the executor information registry has an executor which is not updated in more than 3 heartbeat cycles, the executor is considered to be in a disconnection state.

8. The task disaster recovery system based on the distributed task scheduling framework as claimed in claim 6, wherein the failures occurred in the actuator include a disconnection failure and a restart failure due to a failure;

9. The task disaster recovery system based on the distributed task scheduling framework as claimed in claim 6, wherein the heartbeat period is 30 s.

10. The task disaster recovery system based on the distributed task scheduling framework as claimed in claim 7 or 8, wherein a plurality of the executors and the scheduling center perform information registration discovery in a DB manner.