CN112162841A

CN112162841A - Distributed scheduling system, method and storage medium for big data processing

Info

Publication number: CN112162841A
Application number: CN202011069582.0A
Authority: CN
Inventors: 黄立; 蔡春茂; 段朋
Original assignee: Chongqing Changan Automobile Co Ltd
Current assignee: Chongqing Changan Automobile Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-01

Abstract

The invention discloses a distributed scheduling system, a distributed scheduling method and a storage medium for big data processing, wherein the distributed scheduling system, the distributed scheduling method and the storage medium comprise a scheduling center module which is used for being responsible for dependent configuration and job development of workflow; the leader module is used as a task flow segmentation and distribution node in the cluster, segments the workflow configured by the dispatching center according to the dependency relationship, and sends the segmented specific task node to the follower node; the follower module is used for executing the specific calculation tasks distributed by the leader module, submitting task results and storing task execution logs; the coordinator module is used for taking out tasks to be executed from the database at regular time, and performing load balancing on the leader module by using a Round-Robin algorithm according to the load conditions of all the current leader modules; a task queue module and a metadata module. The invention considers the dependence on the task, avoids the phenomenon that the downstream task runs empty due to the overtime or the empty running of the execution time of the upstream task, and is beneficial to the whole data circulation.

Description

Distributed scheduling system, method and storage medium for big data processing

Technical Field

The invention belongs to the technical field of big data computing task scheduling, and particularly relates to a distributed scheduling system and method for big data processing and a storage medium.

Background

With the rapid development of data technology, modern enterprises begin to move from the IT era to the DT era, and no matter a public cloud or a self-built data center is selected, a large data platform has become an infrastructure of the modern enterprises. The big data platform is iterated gradually from the initial single execution engine MapReduce to the multiple execution engine era of MapReduce, Spark, Flink and the like. At present, in the process of mining data values of enterprises, thousands of data calculation tasks are generated, how to arrange and schedule the tasks is performed, and it is very important to construct an intricate calculation task dependence network.

For example, patent document CN107506381A discloses a big data distributed scheduling analysis method, system device and storage medium: a self-built big data distributed scheduling and analyzing system is described, the core function of the system is to realize the encapsulation of the big data processing technical process, and the scheduling function of the self-contained part, but a method for dealing with the dependence and arrangement of the complicated task flow in the big data scene is not provided, and the whole system has single point of failure, and the high available fault-tolerant strategy is not considered, so the following problems can be faced:

(1) the task distribution mode does not consider the dependency on the task. Once the execution time of the upstream task is overtime or runs empty, the downstream task is likely to have a run-empty phenomenon, which is not favorable for the whole data circulation process and increases the burden of developers.

(2) The server submitting or executing the computing task has a single point of failure, and once the server is down, the computing task cannot be triggered, so that the computing logic is influenced.

Therefore, it is necessary to develop a distributed scheduling system, method and storage medium for big data processing.

Disclosure of Invention

In order to solve the above problems, the present invention provides a distributed scheduling system, method and storage medium for big data processing.

In a first aspect, the present invention provides a big data processing-oriented distributed scheduling system, including:

the scheduling center module is used for being responsible for dependent configuration and job development of the workflow and persisting the configured workflow into a workflow table to be executed of the relational database through the API (application programming interface);

the leader module is used as a task flow segmentation and distribution node in the cluster, segments the workflow configured by the dispatching center according to the dependency relationship, and sends the segmented specific task node to the follower node;

the follower module is also called an executor and is used for executing specific calculation tasks distributed by the leader module, submitting task results and storing task execution logs;

the coordinator module is used for taking out tasks to be executed from the database at regular time, and performing load balancing on the leader module by using a Round-Robin algorithm according to the load conditions of all the current leader modules;

the task queue module is a message queue, comprises workflow topic, task topic and task result topic and is used for realizing task dependence among the workflows;

the metadata module comprises two databases, namely a relational database and a distributed memory database, wherein the relational database is used for persistently storing execution records of the workflow; and the distributed memory database is used for taking out the metadata related to the workflow from the relational database and loading the metadata into the memory.

In a second aspect, the distributed scheduling method for big data processing according to the present invention is a distributed scheduling system for big data processing according to the present invention, and the method includes the following steps:

receiving dependent configuration and job development of the workflow, and persisting the configured workflow into a workflow table to be executed of the relational database through an API (application programming interface);

segmenting the workflow configured by the dispatching center according to the dependency relationship, and sending the segmented specific task nodes to follower nodes;

the system is used for executing specific computing tasks distributed by the leader module, submitting task results and storing task execution logs;

and the system is used for taking out tasks to be executed from the database at regular time, and carrying out load balancing on the leader module by adopting a Round-Robin algorithm according to the load conditions of all the current leader modules.

Further, the coordinator service regularly scans a workflow table to be executed in the relational database to acquire a command to be executed, regularly requests load information of the Leader cluster from the ZooKeeper cluster, and allocates the workflow to the corresponding Leader according to the CPU and memory allowance of each current Leader machine by using a Round-Robin algorithm; and finally, sending the workflow with the Leader label to a process _ instance theme in the message queue, and waiting for the Leader to consume the topic to execute the workflow.

Further, the Leader consumes the process _ instance theme in the message queue, and judges whether the workflow needs to be executed according to the Leader _ host _ name field in the message; if the workflow needs to be executed, dividing the workflow into a plurality of computing tasks according to the dependency relationship of the workflow, and estimating computing resources needed by each computing task; acquiring load information of each machine of the Follower cluster from the ZooKeeper cluster, and performing load balance of the task acquired by the actuator by using a Round-Robin algorithm; adding the independent calculation task after segmentation to corresponding executor Follower _ host _ name information, and sending the result to a task _ instance theme in a message queue; in the process of suspending the workflow execution thread, the Leader consumes data in a task _ instance _ result topic returned by the Follower through the message queue, and updates the execution result of the workflow; and under the condition that the execution state of the whole workflow is changed into a final state, persisting the workflow execution result into the relational database.

Further, the Follower consumes the data of the task _ instance topic in the message queue, the execution is performed according to the Follower _ host _ name matching corresponding actuator, after the task execution is completed, the execution result of the task is written back to the task _ instance _ result topic in the message queue, the Leader consumption task execution result is waited, and after the Leader writes the execution completed task result back to the relational database, the execution of the whole task flow is completed.

Further, after the system is started, the Leader and the Follower register the znode of the Leader _ messages and the Follower _ messages on the zookeeper, provide the CPU and the memory information of the local computer and maintain the heartbeat; each Leader and Follower monitors the znode; and once the downtime of the Leader or the Follower is found, entering a workflow fault tolerance flow, wherein the workflow fault tolerance flow comprises a Leader fault tolerance flow and a Follower fault tolerance flow.

Further, the loader fault tolerance process specifically includes:

each machine of the Leader cluster monitors the znode of the Leader on the ZooKeeper cluster, once the Leader is found to be down, a distributed lock mechanism based on the ZooKeeper cluster is triggered, one active Leader acquires a distributed lock and triggers a workflow fault-tolerant logic, workflow information needing fault tolerance is inserted into a fault-tolerant command table in a relational database, and then the Leader acquiring the distributed lock takes over the workflow to complete the distributed fault-tolerant process of the Leader.

Further, the Follower fault tolerance process specifically includes:

each machine of the Follwer cluster can register itself with a znode on the ZooKeeper cluster, if the Follower executing the task is down, a monitoring mechanism of a Leader is triggered, all running tasks on the currently down Follower are stopped, the Leader marks the workflow as a fault-tolerant state, and the surviving Follower is reselected as an executor of the remaining tasks of the workflow.

In a third aspect, the storage medium of the present invention stores therein a computer readable program, which when called by an executor, can execute the steps of the distributed scheduling method for big data processing according to the present invention.

The invention has the following advantages: :

(1) the dependence on the tasks is considered in the task distribution mode, so that the phenomenon that the downstream tasks run empty due to overtime or running empty of the execution time of the upstream tasks is avoided, the whole data flow is facilitated, and the burden of developers is reduced.

(2) And designing a fault tolerance strategy, and when a single point of failure exists in the whole system, taking over the corresponding workflow and continuously executing the workflow.

(3) The whole dispatching cluster can realize linear expansion of the Leader node and the Worker node.

Drawings

Fig. 1 is a diagram of a Follower actuator node architecture in this embodiment;

FIG. 2 is a general construction diagram of the present embodiment;

FIG. 3 is a flowchart of workflow execution according to the present embodiment;

FIG. 4 is a schematic diagram of a loader fault tolerance according to the present embodiment;

fig. 5 is a schematic diagram of a fowlower fault tolerance according to this embodiment.

Detailed Description

The invention will be further explained with reference to the drawings.

In this embodiment, a distributed scheduling system for big data processing includes:

the scheduling center module is provided with a scheduling center Web interface, provides a simple and convenient task visual configuration window for a user, and provides the functions of monitoring, operation and maintenance of the operation of the scheduling platform. The scheduling center module is used for being responsible for dependent configuration and job development of the workflow, and persisting the configured workflow to a workflow table to be executed of a relational database (DB for short) through an API (application programming interface); workflow dependency is described by Json, with its predecessor and successor job id information stored in the Json data for each job.

And the leader module is used as a task flow segmentation and distribution node in the cluster, segments the workflow configured by the dispatching center according to the dependency relationship, and sends the segmented specific task node to the follower node.

The follower module, also called as an executor, is used for executing specific computing tasks distributed by the leader module, submitting task results and storing task execution logs.

And the coordinator module is used for taking out tasks to be executed from the database at regular time and carrying out load balancing on the leader module by adopting a Round-Robin algorithm according to the load conditions of all the current leader modules.

The task queue module is a message queue (MQ for short) and comprises a workflow topic, a task topic and a task result topic.

The metadata module comprises two databases, namely a relational database and a distributed memory database, wherein the relational database is used for persistently storing execution records of the workflow; the distributed memory database is used for taking out the metadata related to the workflow from the relational database and loading the metadata into the memory, so that the delay in the workflow running process is reduced, and the operation efficiency is improved.

The system is a multi-execution-engine distributed task scheduling system facing to a big data platform complex computation task scene. The system can divide the workflow based on the decentralization idea aiming at the user-defined calculation task workflow. And distributing the segmented computing task to a Follower in the cluster for execution. Task dependence between workflows is realized by using a message queue, and the method has the expression capability of a complex dependence DAG graph. The high availability of the leader module and the follower module is realized based on the distributed coordination service zookeeper, so that the whole system can realize linear expansion in the operation process.

In the data platform, a complete data processing task includes: the method comprises four stages of data access, data cleaning, data mining and analysis result storage, namely, a complete data processing workflow comprises a plurality of computing engines. As shown in fig. 1, the Follower does not serve as a running node of a specific computing task, but uses a gateway node in the big data platform, that is, a submitting node of the computing task, as a Follower node. The FOLLOWER node is provided with a gateway of computing engines such as an Sqoop client, a Spark client, a flight client, a Hive client and the like, is not directly used as a running node of a computing task, can realize the capability of scheduling the computing tasks of multiple execution engines, realizes the decoupling of a scheduling platform and the computing platform, and avoids resource competition.

In this embodiment, the distributed scheduling method for big data processing adopts the distributed scheduling system for big data processing as described in this embodiment, and the method includes the following steps:

As shown in fig. 2 and 3, the specific flow of the method is as follows:

the scheduling center module is responsible for dependent configuration and job development of the workflow, and the configured workflow is persisted into a workflow table to be executed of the relational database through the API; workflow dependency is described by Json, with its predecessor and successor job id information stored in the Json data for each job.

A coordinator service (i.e., coordinator) periodically scans a to-be-executed workflow table in a relational database to obtain a to-be-executed command, periodically requests load information of a Leader cluster from a ZooKeeper (ZK for short), and allocates the workflow to a corresponding Leader according to a CPU and memory margin of each current Leader machine by using a Round-Robin algorithm. And finally, sending the workflow with the Leader label to a process _ instance theme in the message queue, and waiting for the Leader to consume the topic to execute the workflow.

A processor _ instance theme in a message queue is consumed by a Leader, and whether the workflow needs to be executed is judged according to a Leader _ host _ name field in the message; and if the workflow needs to be executed, dividing the workflow into a plurality of computing tasks according to the dependency relationship of the workflow, and estimating computing resources needed by each computing task. Acquiring load information of each machine of the Follower cluster from the ZooKeeper cluster, and performing load balance of the task acquired by the actuator by using a Round-Robin algorithm; and adding the independent calculation task after segmentation to corresponding executor Follower _ host _ name information, and sending the result to a task _ instance theme in a message queue. In the process of suspending the workflow execution thread, the Leader consumes the data in the task _ instance _ result topic returned by the folower through the message queue, and updates the execution result of the workflow. And under the condition that the execution state of the whole workflow is changed into a final state, persisting the workflow execution result into the relational database.

And consuming the data of the task _ instance topic in the message queue by the following Follower, and executing the data according to the fact that the Follower _ host _ name is matched with the corresponding actuator. After the task is executed, the execution result of the task is written back to the task _ instance _ result topic in the message queue, and the Leader is waited to consume the execution result of the task. And after the Leader writes the executed task result back to the relational database, the whole task flow is executed.

Being a dispatch system with distributed capability, fault tolerant design is the core that must be considered for the entire system because of the natural unreliability of distributed systems. Distributed fault tolerance of the whole scheduling system is realized based on ZooKeeper. After the system is started, the Leader and the Follower register to the znode of the Leader _ mechs and the Follower _ mechs on the zookeeper, provide CPU and memory information of the local computer and maintain heartbeat; each Leader and Follower monitors the znode; and once the downtime of the Leader or the Follower is found, entering a workflow fault tolerance flow, wherein the workflow fault tolerance flow comprises a Leader fault tolerance flow and a Follower fault tolerance flow.

As shown in fig. 4, in this embodiment, the loader fault tolerance process specifically includes:

each machine of the Leader cluster monitors the znode on the ZooKeeper, once the Leader crash is found, a distributed lock mechanism based on the ZooKeeper is triggered, one of the active leaders acquires the distributed lock, workflow fault-tolerant logic is triggered, workflow information needing fault tolerance is inserted into a fault-tolerant command table in the relational database, and then the Leader acquiring the distributed lock takes over the workflow to complete the distributed fault-tolerant process of the Leader. As shown in fig. 4, after the Leader1 is hung, the Leader2 acquires the distributed lock, then triggers the workflow fault tolerance logic, inserts the information of the workflow which needs fault tolerance into the fault tolerance command table in the relational database, and then the Leader2 takes over the workflow to complete the distributed fault tolerance process of the Leader.

In this embodiment, the Follower fault tolerance process specifically includes:

As shown in fig. 5, when the Follwer1 goes down, it is re-executed by the Follwer 2.

In this embodiment, a storage medium stores therein a computer readable program, and when the computer readable program is called by an executor, the steps of the distributed scheduling method for big data processing as described in this embodiment can be executed.

Claims

1. A big data processing-oriented distributed scheduling system, comprising:

2. A big data processing-oriented distributed scheduling method, which is characterized in that the big data processing-oriented distributed scheduling system of claim 1 is adopted, and the method comprises the following steps:

3. The big-data-processing-oriented distributed scheduling method of claim 2, wherein: the coordinator service regularly scans a workflow table to be executed in a relational database to obtain a command to be executed, regularly requests load information of a Leader cluster from a ZooKeeper, and allocates the workflow to a corresponding Leader according to the CPU and memory allowance of each current Leader machine by adopting a Round-Robin algorithm; and finally, sending the workflow with the Leader label to a process _ instance theme in the message queue, and waiting for the Leader to consume the topic to execute the workflow.

4. The big-data-processing-oriented distributed scheduling method of claim 3, wherein: a processor _ instance theme in a message queue is consumed by a Leader, and whether the workflow needs to be executed is judged according to a Leader _ host _ name field in the message; if the workflow needs to be executed, dividing the workflow into a plurality of computing tasks according to the dependency relationship of the workflow, and estimating computing resources needed by each computing task; acquiring load information of each machine of the Follower cluster from the ZooKeeper cluster, and performing load balance of the task acquired by the actuator by using a Round-Robin algorithm; adding the independent calculation task after segmentation to corresponding executor Follower _ host _ name information, and sending the result to a task _ instance theme in a message queue; in the process of suspending the workflow execution thread, the Leader consumes data in a task _ instance _ result topic returned by the Follower through the message queue, and updates the execution result of the workflow; and under the condition that the execution state of the whole workflow is changed into the final state, persisting the execution result state of the workflow into the relational database.

5. The big-data-processing-oriented distributed scheduling method of claim 4, wherein: and after the task is executed, writing the execution result of the task back to the task instance result in the message queue, waiting for the Leader to consume the execution result of the task, and after the Leader writes the executed task result back to the relational database, finishing the execution of the whole task flow.

6. The big data processing-oriented distributed scheduling method according to claim 4 or 5, wherein: after the system is started, the Leader and the Follower register to the znode of the Leader _ mechs and the Follower _ mechs on the zookeeper, provide CPU and memory information of the local computer and maintain heartbeat; each Leader and Follower monitors the znode; and once the downtime of the Leader or the Follower is found, entering a workflow fault tolerance flow, wherein the workflow fault tolerance flow comprises a Leader fault tolerance flow and a Follower fault tolerance flow.

7. The big-data-processing-oriented distributed scheduling method of claim 6, wherein: the loader fault-tolerant process specifically comprises the following steps:

each machine of the Leader cluster monitors the znode on the ZooKeeper, once the Leader crash is found, a distributed lock mechanism based on the ZooKeeper is triggered, one of the active leaders acquires the distributed lock, workflow fault-tolerant logic is triggered, workflow information needing fault tolerance is inserted into a fault-tolerant command table in the relational database, and then the Leader acquiring the distributed lock takes over the workflow to complete the distributed fault-tolerant process of the Leader.

8. The big-data-processing-oriented distributed scheduling method of claim 7, wherein: the Follower fault-tolerant process specifically comprises the following steps:

each machine of the Follwer cluster can register itself with a znode on the ZooKeeper, if the Follower executing the task is down, a monitoring mechanism of the Leader is triggered, all running tasks on the currently down Follower are stopped, the Leader marks the workflow as a fault-tolerant state, and the surviving Follower is reselected as an executor of the remaining tasks of the workflow.

9. A storage medium having a computer-readable program stored therein, characterized in that: the computer readable program, when being invoked by an executor, is capable of performing the steps of the big data processing oriented distributed scheduling method of any one of claims 2 to 8.