CN112667375A

CN112667375A - Task scheduling method and system based on big data service

Info

Publication number: CN112667375A
Application number: CN202011523358.4A
Authority: CN
Inventors: 褚庆; 张炜; 张少杰; 王彦青; 王伟丽; 祝勇; 郝荟枫; 郝广
Original assignee: EB INFORMATION TECHNOLOGY Ltd
Current assignee: EB INFORMATION TECHNOLOGY Ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-16

Abstract

A task scheduling method and system based on big data service includes: generating a DAG file based on a DAG dependency graph constructed by a user, uploading the DAG file to a scheduling service device, and synchronizing the related dependency file to an execution device: the scheduling service device selects 1 scheduling process from the process pool of the scheduling service device, and distributes the DAG file to the selected scheduling process; the scheduling process analyzes the DAG file, creates a process batch instance and task instances corresponding to all components in the process respectively, then constructs 1 task state table, and finally extracts the task instance to be executed from the DAG file and pushes the task instance to a task message queue; and the execution device executes the task instances in the task message queue in sequence and updates the states of the task instances in the task state table according to the execution result. The invention belongs to the technical field of information, and can support plug and play of components and meet complex dependence requirements among multiple tasks in multi-task scheduling of big data services.

Description

Task scheduling method and system based on big data service

Technical Field

The invention relates to a task scheduling method and system based on big data service, belonging to the technical field of information.

Background

In the field of big data services, service flow processing often has the characteristics of multiple branches and complex rules, the existing task scheduling is mostly based on time triggering, although a complex timing scheduling function can also be realized, the complex dependence requirements among scheduling tasks of the big data services still cannot be met (for example, the execution of a task C requires that 1 of the tasks A or B is successfully executed as a precondition), and flexible expansion and hot plug of scheduling components cannot be achieved.

Therefore, how to support plug and play of big data components in multi-task scheduling of big data services and meet the complex requirement of multi-task dependence of big data services has become a technical problem of great attention of technicians.

Disclosure of Invention

In view of this, an object of the present invention is to provide a task scheduling method and system based on big data service, which can support plug and play of big data components in multi-task scheduling of big data service, and simultaneously meet the complex dependency requirement between big data service multi-tasks.

In order to achieve the above object, the present invention provides a task scheduling method based on big data service, which includes a scheduling service device and an execution device, and the method includes:

step one, a user selects a plurality of big data components from a graphical interface, 1 DAG dependency graph for describing a process is constructed based on the selected components, the task information of each component and the dependency relationship among the components, then a corresponding DAG file is generated based on the DAG dependency graph constructed by the user, the generated DAG file is uploaded to a scheduling service device, and meanwhile, the related dependency files are synchronized to an execution device:

step two, the scheduling service device selects 1 scheduling process from the process pool of the scheduling service device, and distributes the received DAG file to the selected scheduling process;

step three, the scheduling process analyzes the distributed DAG files, creates a process batch instance and task instances corresponding to all components in the process for the DAG files respectively, then constructs 1 task state table, the task state table is used for recording the state information of all the task instances in the process, finally extracts the task instances needing to be executed from the DAG files, and pushes the task instances needing to be executed to a task message queue;

and step four, the execution device reads and executes each task instance in the task message queue in sequence, and updates the state information of the executed task instance in the task state table according to the execution result.

In order to achieve the above object, the present invention further provides a task scheduling system based on big data services, which includes a graphical interaction device, a scheduling service device, and an execution device, wherein:

the system comprises a graphical interaction device, a scheduling service device and an execution device, wherein the graphical interaction device is used for providing a graphical interface for a user, when the user selects a plurality of big data components from the graphical interface and sets flow information, task information of each component and a dependency relationship among the components based on the selected components so as to construct 1 DAG dependency graph for describing the flow, a corresponding DAG file is generated based on the DAG dependency graph constructed by the user, the generated DAG file is uploaded to the scheduling service device, and meanwhile, the related dependency file is synchronized to the execution device;

the scheduling service device is used for selecting 1 scheduling process unit from the process pool of the scheduling service device and distributing the received DAG file to the selected scheduling process unit;

the execution device is used for reading and executing each task instance in the task message queue in sequence and updating the state information of the executed task instance in the task state table according to the execution result,

the scheduling service device further comprises a plurality of scheduling process units, wherein:

and the scheduling process unit is used for analyzing the distributed DAG files, creating a process batch instance and task instances corresponding to all components in the process for the DAG files, then constructing 1 task state table, wherein the task state table is used for recording the state information of all the task instances in the process, finally extracting the task instances needing to be executed from the DAG files, and pushing the task instances needing to be executed to the task message queue.

Compared with the prior art, the invention has the beneficial effects that: the invention constructs a set of task scheduling method and system based on big data service based on Directed Acyclic Graph (DAG), has high efficiency, reliability and usability, and can effectively solve the following problems: (1) the time dependence is as follows: the task needs to wait for a certain 1 time point to trigger execution; (2) external system dependencies: the task depends on an external system and needs to remotely call an interface to access external resources; (3) inter-task dependencies: the execution of a plurality of tasks mutually affects, for example, the execution of the task C requires that 1 of the tasks A or B is successfully executed as a precondition; (4) and (3) repeatedly developing the components: tasks with the same functions are repeatedly compiled, the expandability is poor, and plug and play cannot be supported; (5) task scheduling cost is high: and task scheduling is realized through a programming language, and the learning threshold is high.

Drawings

Fig. 1 is a flowchart of a task scheduling method based on big data service according to the present invention.

Fig. 2 is a specific flowchart of the scheduling process monitoring and discovering that the states of 1 task instance in its own task state table change.

Fig. 3 is a specific flowchart of the failure monitoring device selecting 1 dispatch service device from a plurality of dispatch service devices to start dispatch service.

Fig. 4 is a schematic structural diagram of a task scheduling system based on big data service according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

As shown in fig. 1, a task scheduling method based on big data service of the present invention includes a scheduling service device and an execution device, and the method includes:

step one, a user selects a plurality of big data components from a graphical interface, 1 DAG dependency graph for describing a process is constructed based on the selected components, the process information of each component and the dependency relationship among the components, then a corresponding DAG file is generated based on the DAG dependency graph constructed by the user, the generated DAG file is uploaded to a scheduling service device, and meanwhile, the related dependency files are synchronized to an execution device, wherein the big data components can be general code blocks which are packaged through an API or a command line and complete specific functions, and can be recycled, such as HDFS data reading, Hive partition detection, Spark task submission and the like;

The first step may further comprise:

step 11, a user drags the selected components to a working area, and flow information, task information of each component and the dependency relationship among the components are supplemented to form a DAG dependency graph;

step 12, checking whether the process identification exists in the database, and if so, performing a reverse deployment operation, namely deleting all task information and dependency information related to the process in the database;

step 13, generating a DAG file according to the database data;

step 14, judging whether a loop exists in the DAG file or not based on depth-first search, if so, prompting that the DAG file is illegal, returning to an editing page, and submitting to a user for modification; if not, continuing the next step;

step 15, synchronizing the dependent file to the execution device;

and step 16, after the synchronization is successful, deploying the DAG file to a scheduling service specified directory.

The invention can carry out flexible hot plug operation to the assembly according to the requirement of the user, the user can also build the assembly and delete the assembly, when the user builds the assembly on the graphical interface, the invention also comprises:

and establishing 1 state machine and 1 associated process record table for the newly-built component, wherein the state machine is used for identifying the current state of the component, the associated process record table is used for recording all process batch instances associated with the component, the state of the state machine is initialized to a preparation state, after a user finishes editing the newly-built component, the dependent file required by the newly-built component is synchronized into an execution device, and the state of the state machine is updated to an online state.

In step three, when the scheduling process parses the distributed DAG file and creates a process batch instance and task instances corresponding to all components in the process for the DAG file, the method further includes:

the dispatching process extracts the associated flow record table of each component in the flow one by one, and stores the created flow batch example as 1 added record into the associated flow record table of the extracted component,

after the execution of the process batch instance corresponding to the DAG file is completed (including the execution success and failure), the following steps are also included:

and the dispatching process extracts the associated process record table of each component in the process one by one and deletes the executed process batch instance record from the associated process record table of the extracted component.

The graphical interface only displays the components with the online states to the user, and when the user updates the state of a certain component state machine to be the offline state, the graphical interface does not display the components with the offline state to the user.

When the user deletes the component on the graphical interface, the method further comprises the following steps:

read the associated flow record table of the deleted component and determine whether the record in the associated flow record table is empty? If yes, deleting the component and the dependent file of the component; and if not, deleting the component and the dependent file of the component when the record in the associated process record table is empty.

The executing device can further comprise a message agent unit and a plurality of task executing units, wherein the message agent unit distributes task instances in the task message queue to the task executing units in sequence to execute. When the task execution unit fails to execute a certain task instance, the message agent unit in the invention can also reselect the task execution unit with high success rate for the task instance which fails to execute according to the execution success rate of all the task execution units, thereby improving the success rate of the failed rerun. Therefore, the step four may further include:

the message broker unit extracts each task instance from the task message queue in order and determines if the extracted task instance is first executed? If yes, the extracted task instance is randomly distributed to 1 task execution unit for execution, and the task execution unit updates the state of the corresponding task instance in the task state table according to the execution result; if not, calculating the execution success rate of all task execution units, wherein the execution success rate is the ratio of the number of the task instances which are successfully executed to the number of all executed task instances, then selecting 1 task execution unit with the highest execution success rate, then distributing the extracted task instances to the selected task execution units for execution, and finally updating the states of the corresponding task instances in the task state table by the task execution units according to the execution results.

Each scheduling process monitors whether the state of each task instance in the task state table of the scheduling process changes in real time, and the state of the task instance can include: the task instances are in a queue, running successfully, waiting for running again and running unsuccessfully, and different states of the task instances can be identified and displayed to a user by using different colors, for example, light green represents running, dark green represents running successfully, red represents running unsuccessfully, and yellow represents waiting for running again. As shown in fig. 2, when the scheduling process monitors and finds that the states of 1 task instance in its task state table change, the method further includes:

step A1, the scheduling process reads the state of the changed task instance and judges whether the state meets the triggering condition in the DAG file, if yes, a plurality of task instances triggered by the triggering condition are extracted, when the extracted content is empty, the execution of the process batch instance is successful, the process is finished, when the extracted content is not empty, the extracted task instance is pushed to the message queue, meanwhile, the state of the pushed task instance in the task state table is updated to the queue, and the process is finished; if not, continuing to the next step:

step A2, the scheduling process judges whether the state of the changed task instance is waiting for re-running and the re-running times are less than the threshold value of times, if yes, the task instance is continuously pushed to the task message queue, the state of the pushed task instance in the task state table is updated to the queue, and the process is ended; if not, the execution of the process batch instance is failed, operation failure information is displayed to a user, and the execution of the process is terminated. The rerun times are used for recording the running times of the task instances, and the time threshold value can be set according to actual business needs.

It should be noted that the present invention may further include a plurality of dispatch service devices and 1 fault monitoring device, so as to support real-time fault transfer and have multi-machine high availability. As shown in fig. 3, the fault monitoring apparatus selects 1 dispatch service apparatus from the plurality of dispatch service apparatuses to start dispatch services (i.e. step two and step three in fig. 1), and further includes:

step B1, each scheduling service device comprises 1 fault transfer service unit, and the transfer service units on all the scheduling service devices are registered to the fault monitoring device and monitor the fault monitoring device in real time;

in order to support more extensive high availability, the number of dispatch service units does not set an upper limit;

step B2, when the dispatching service is started for the first time or the current dispatching service is monitored to have a fault, the fault monitoring device informs all the registered and monitored fault transfer service units to connect with the fault monitoring device and carries out one-time starting judgment;

step B3, the fault monitoring device notifies the dispatch service device where the first initiated failover service unit is located to start the dispatch service, and the other dispatch service devices continue to be in the monitoring state.

As shown in fig. 4, the task scheduling system based on big data service of the present invention includes a graphical interaction device, a scheduling service device, and an execution device, wherein:

the system comprises a graphical interaction device, a scheduling service device and an execution device, wherein the graphical interaction device is used for providing a graphical interface for a user, when the user selects a plurality of big data assemblies from the graphical interface, and sets flow information, task information of each assembly and a dependency relationship among the assemblies based on the selected assemblies so as to construct 1 DAG dependency graph for describing the flow, corresponding DAG files are generated based on the DAG dependency graph constructed by the user, the generated DAG files are uploaded to the scheduling service device, and the related dependency files are synchronized to the execution device at the same time, wherein the big data assemblies can be general code blocks which are packaged through an API or a command line and complete specific functions, and can be repeatedly utilized, such as HDFS data reading, Hive partition detection, Spark task submission and the like;

The invention can carry out flexible hot plug operation on the assembly according to the requirement of a user, and the user can also create the assembly and delete the assembly. The graphical interaction device further comprises:

the system comprises a component newly building unit, a component newly building unit and a component updating unit, wherein the component newly building unit is used for building 1 state machine and 1 associated process record table for the newly built component when a user newly builds the component on a graphical interface, the state machine is used for identifying the current state of the component, the associated process record table is used for recording all process batch instances associated with the component, and the state of the state machine is initialized to a preparation state; after the user finishes editing the new component, synchronizing the dependent file required by the new component into the execution device, and updating the state of the state machine into an online state;

the component deleting unit is used for reading the associated process record table of the deleted component when the user deletes the component on the graphical interface, judging whether the record in the associated process record table is empty, and deleting the component and the dependent file of the component if the record in the associated process record table is empty; and if not, deleting the component and the dependent file of the component when the record in the associated process record table is empty.

The scheduling process unit can include:

the component associated flow updating component is used for extracting an associated flow record table of each component in the flow one by one when the distributed DAG file is analyzed by the scheduling process unit, a flow batch example and task examples corresponding to all the components in the flow are created for the DAG file, and storing the created flow batch example serving as 1 newly added record into the associated flow record table of the extracted component; and after the process batch examples corresponding to the DAG file are executed, extracting the associated process record table of each component in the process one by one, and deleting the executed process batch example records from the associated process record table of the extracted components.

The imaging interaction device only displays the components of the online state to the user, and when the user updates the state of a certain component state machine to the offline state, the components of the offline state are not displayed to the user.

The executing device can further comprise a message agent unit and a plurality of task executing units, wherein the message agent unit distributes task instances in the task message queue to the task executing units in sequence to execute. When the task execution unit fails to execute a certain task instance, the message agent unit in the invention can also reselect the task execution unit with high success rate for the task instance which fails to execute according to the execution success rate of all the task execution units, thereby improving the success rate of the failed rerun. Therefore, the message agent unit may further include:

the intelligent message distributing component is used for sequentially extracting each task instance from the task message queue and judging whether the extracted task instance is executed for the first time, if so, the extracted task instance is randomly distributed to 1 task execution unit for execution, and the task execution unit updates the state of the corresponding task instance in the task state table according to the execution result; if not, calculating the execution success rate of all task execution units, wherein the execution success rate is the ratio of the number of the task instances which are successfully executed to the number of all executed task instances, then selecting 1 task execution unit with the highest execution success rate, then distributing the extracted task instances to the selected task execution units for execution, and finally updating the states of the corresponding task instances in the task state table by the task execution units according to the execution results.

Each scheduling process unit monitors whether the state of each task instance in the task state table of the scheduling process unit changes in real time, and the state of the task instance can include: the task instances are in a queue, running successfully, waiting for running again and running unsuccessfully, and different states of the task instances can be identified and displayed to a user by using different colors, for example, light green represents running, dark green represents running successfully, red represents running unsuccessfully, and yellow represents waiting for running again. The scheduling process unit may further include:

the task state monitoring component is used for monitoring whether the state of each task instance in the task state table per se changes or not, reading the state of the changed task instance when the state of one task instance changes, judging whether the state meets the triggering condition in the DAG file or not, if so, extracting a plurality of task instances triggered by the triggering condition correspondingly, indicating that the process batch instance is successfully executed when the extraction content is empty, pushing the extracted task instance to a message queue when the extraction content is not empty, and updating the state of the pushed task instance in the task state table to the queue; if not, continuously judging whether the state of the changed task instance is waiting for re-running and the re-running frequency is smaller than the frequency threshold, pushing the task instance to a task message queue when the state is waiting for re-running and the re-running frequency is smaller than the frequency threshold, updating the state of the task instance pushed in the task state table to the queue, indicating that the execution of the batch of process instances fails when the state is not waiting for re-running or the re-running frequency is not smaller than the frequency threshold, showing the operation failure information to the user, and terminating the execution of the process. The rerun times are used for recording the running times of the task instances, and the time threshold value can be set according to actual business needs.

It should be noted that the present invention may further include a plurality of dispatch service devices and 1 fault monitoring device, so as to support real-time failover and have multi-machine high availability. The fault monitoring device can select 1 dispatching service device from a plurality of dispatching service devices to start dispatching service, each dispatching service device comprises 1 fault transfer service unit, and the fault transfer service units are registered to the fault monitoring device and monitor the fault monitoring device in real time,

and the fault monitoring device informs all fault transfer service units registered and monitored to connect with the fault monitoring device and perform one-time starting judgment when the dispatching service is started for the first time or the current dispatching service is monitored to have a fault, and then informs the dispatching service device where the fault transfer service unit started firstly to start the dispatching service, and other dispatching service devices are continuously in a monitoring state.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A task scheduling method based on big data service is characterized in that the task scheduling method comprises a scheduling service device and an execution device, and the method comprises the following steps:

2. The method of claim 1, when a user creates a new component on the graphical interface, further comprising:

creating 1 state machine and 1 associated process record table for the newly-built component, wherein the state machine is used for identifying the current state of the component, the associated process record table is used for recording all process batch instances associated with the component, initializing the state of the state machine into a preparation state, after a user finishes editing the newly-built component, synchronizing a dependent file required by the newly-built component into an execution device, and updating the state of the state machine into an online state,

reading the associated process record table of the deleted component, judging whether the record in the associated process record table is empty, and if so, deleting the component and the dependent file of the component; and if not, deleting the component and the dependent file of the component when the record in the associated process record table is empty.

3. The method according to claim 1, wherein in step three, when the scheduling process parses the allocated DAG file, and creates a process batch instance and task instances corresponding to all components in the process for the DAG file, the method further includes:

after the execution of the process batch instance corresponding to the DAG file is completed, the method further comprises the following steps:

4. The method of claim 1, wherein the execution device further includes a message agent unit and a plurality of task execution units, the message agent unit sequentially distributes task instances in the task message queue to the respective task execution units for execution, and step four further includes:

the message agent unit extracts each task instance from the task message queue in sequence, judges whether the extracted task instance is executed for the first time, and if so, randomly distributes the extracted task instance to 1 task execution unit for execution, and updates the state of the corresponding task instance in the task state table according to the execution result; if not, calculating the execution success rate of all task execution units, wherein the execution success rate is the ratio of the number of the task instances which are successfully executed to the number of all executed task instances, then selecting 1 task execution unit with the highest execution success rate, then distributing the extracted task instances to the selected task execution units for execution, and finally updating the states of the corresponding task instances in the task state table by the task execution units according to the execution results.

5. The method according to claim 1, wherein each scheduling process monitors whether the state of each task instance in its own task state table changes in real time, and when the scheduling process monitors that the state of 1 task instance in its own task state table changes, the method further comprises:

step A1, the scheduling process reads the state of the changed task instance and judges whether the state meets the triggering condition in the DAG file, if yes, a plurality of task instances triggered by the triggering condition are extracted, when the extracted content is empty, the execution of the process batch instance is successful, the process is finished, when the extracted content is not empty, the extracted task instance is pushed to a task message queue, meanwhile, the state of the pushed task instance in the task state table is updated to the queue, and the process is finished; if not, continuing to the next step:

step A2, the scheduling process judges whether the state of the changed task instance is waiting for re-running and the re-running times are less than the threshold value of times, if yes, the task instance is continuously pushed to the task message queue, the state of the pushed task instance in the task state table is updated to the queue, and the process is ended; if not, the execution of the process batch instance is failed, operation failure information is displayed to a user, and the execution of the process is terminated.

6. The method of claim 1, wherein the method comprises a plurality of dispatch service units and 1 fault monitoring unit, the fault monitoring unit selecting 1 dispatch service unit from the plurality of dispatch service units to initiate dispatch services, further comprising:

7. A task scheduling system based on big data service is characterized by comprising a graphical interaction device, a scheduling service device and an execution device, wherein:

8. The system of claim 7, wherein the graphical interaction device further comprises:

9. The system of claim 7, wherein the scheduling process unit comprises:

10. The system of claim 7, wherein the execution device further includes a message agent unit and a plurality of task execution units, the message agent unit sequentially distributes task instances in the task message queue to the respective task execution units for execution, and the message agent unit further includes:

11. The system of claim 7, wherein the scheduling process unit further comprises:

the task state monitoring component is used for monitoring whether the state of each task instance in the task state table per se changes or not, reading the state of the changed task instance when the state of one task instance changes, judging whether the state meets the triggering condition in the DAG file or not, if so, extracting a plurality of task instances triggered by the triggering condition correspondingly, indicating that the process batch instance is successfully executed when the extraction content is empty, pushing the extracted task instance to a task message queue when the extraction content is not empty, and updating the state of the pushed task instance in the task state table to the queue; if not, continuously judging whether the state of the changed task instance is waiting for re-running and the re-running frequency is smaller than the frequency threshold, pushing the task instance to a task message queue when the state is waiting for re-running and the re-running frequency is smaller than the frequency threshold, updating the state of the task instance pushed in the task state table to the queue, indicating that the execution of the batch of process instances fails when the state is not waiting for re-running or the re-running frequency is not smaller than the frequency threshold, showing the operation failure information to the user, and terminating the execution of the process.

12. The system of claim 7, comprising a plurality of dispatch service devices and 1 fault monitoring device, wherein the fault monitoring device selects 1 dispatch service device from the plurality of dispatch service devices to start dispatch service, each dispatch service device comprises 1 failover service unit, and the failover service units are registered with the fault monitoring device and monitor the fault monitoring device in real time,