CN113900788A - Distributed work scheduling method and distributed workflow engine system - Google Patents

Distributed work scheduling method and distributed workflow engine system Download PDF

Info

Publication number
CN113900788A
CN113900788A CN202111220515.9A CN202111220515A CN113900788A CN 113900788 A CN113900788 A CN 113900788A CN 202111220515 A CN202111220515 A CN 202111220515A CN 113900788 A CN113900788 A CN 113900788A
Authority
CN
China
Prior art keywords
leader
broker
partition
logical partition
worker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111220515.9A
Other languages
Chinese (zh)
Inventor
王宏志
周效军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202111220515.9A priority Critical patent/CN113900788A/en
Publication of CN113900788A publication Critical patent/CN113900788A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications

Abstract

The invention discloses a distributed work scheduling method and a distributed workflow engine system, wherein the distributed work scheduling method comprises the following steps: the method comprises the following steps that a Broker in a workflow engine Broker cluster carries out flow scheduling and task issuing, so that each task in a flow example is executed by a unique Worker; the Broker cluster is divided into a plurality of logic partitions, and different logic partitions execute scheduling tasks of different process instances; the Worker pulls the executable task from the Broker cluster, executes and returns the execution result to the corresponding Broker, and the driving process continues to execute. By isolating the workflow schedule from the execution entities of the work tasks and using a partitioned Broker cluster architecture, level scalability and reliability of performance is supported.

Description

Distributed work scheduling method and distributed workflow engine system
Technical Field
The invention relates to the technical field of communication, in particular to a distributed work scheduling method and a distributed workflow engine system.
Background
Orchestration of microservices requires the assistance of a workflow engine.
The workflow engines in the prior art have the problem of low processing performance because they use relational databases to store data, which are limited by database processing and cannot achieve higher performance.
Disclosure of Invention
In view of the above, embodiments of the present invention are proposed to provide a distributed work scheduling method and a distributed workflow engine system that overcome or at least partially solve the above problems.
According to an embodiment of the present invention, a distributed job scheduling method is provided, including:
the Broker in the Broker cluster carries out flow scheduling and task publishing, and each task in the flow example is executed by a unique Worker; the Broker cluster is divided into a plurality of logic partitions, and different logic partitions execute scheduling tasks of different process instances;
and the Worker pulls the executable task from the Broker cluster, executes and returns the execution result to the corresponding Broker, and the driving process continues to execute.
According to an embodiment of the present invention, there is provided a distributed workflow engine system including:
a workflow engine Broker cluster and a Worker; the Broker cluster is divided into a plurality of logic partitions, and different logic partitions execute scheduling tasks of different process instances;
the Broker is used for scheduling the process and issuing the tasks, so that each task in the process instance is executed by a unique Worker;
the Worker is used for pulling the executable task from the Broker cluster, executing and returning the execution result to the corresponding Broker, and driving the process to continue executing.
According to the scheme provided by the embodiment of the invention, the workflow scheduling and the execution entity of the work task are separated, and the partitioned Broker cluster architecture is used, so that the scalability and the reliability of the performance level are supported.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic flowchart illustrating a distributed job scheduling method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating a process of acquiring node information through the gossip protocol in the distributed work scheduling method;
fig. 3 is a schematic flow chart illustrating sending a SYN message in a distributed work scheduling method according to an embodiment of the present invention;
fig. 4 is a schematic flowchart illustrating an ACK message reply in the distributed work scheduling method according to an embodiment of the present invention;
fig. 5 is a schematic flowchart illustrating a reply ACK2 message in the distributed work scheduling method according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a distributed workflow engine system according to a second embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a distributed workflow engine system according to a third embodiment of the present invention;
FIG. 8 is a flow chart illustrating a leader event loop in the distributed workflow engine system according to a third embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Example one
Fig. 1 shows a flowchart of a distributed job scheduling method according to an embodiment of the present invention, and the execution subjects of the embodiment are a workflow engine Broker and a Worker, which are suitable for the distributed workflow engine system shown in fig. 7 and 8. As shown in fig. 1, the distributed work scheduling method includes:
step S110, the Broker in the Broker cluster carries out flow scheduling and task issuing so that each task in the flow example is executed by a unique Worker; the Broker cluster is divided into a plurality of logic partitions, and different logic partitions execute scheduling tasks of different process instances.
In this embodiment, the execution main body is divided into two roles, namely, a distributed workflow engine Broker and a Worker for executing user-defined logic according to different execution functions, the relationship between the two roles is as shown in fig. 7, the Broker is only responsible for flow scheduling and task publishing, it is ensured that each task in the flow instance is only executed by a unique Worker, each Worker is only responsible for pulling an executable flow task from the cluster and then executing the task, and returning the task execution result to the Broker, and the flow is driven to continue to be executed backwards. The two roles have low coupling degree and good function/performance expandability.
Each logical partition can contain a plurality of brokers, all the brokers of a Broker cluster load flow definitions during flow deployment, the brokers in the same logical partition store the same flow instance data, and the brokers in different logical partitions store different flow instance data.
Optionally, each logical partition may have a respective log file and database. The log file may be a Write-Ahead Logging (WAL) file. The database can be a storage database RocksDB, and the reliability and the efficiency of data transmission and storage can be ensured through the RocksDB.
Further, in order to reduce the amount of data stored in the RocksDB and improve the read performance of the RocksDB, the RocksDB may store only the flow definitions of the deployment flows, the flow instance data being executed, and not the flow instance data that has already been completed.
And step S120, the Worker pulls the executable task from the Broker cluster, executes and returns the execution result to the corresponding Broker, and the driving process continues to execute.
The present embodiment supports performance level scalability and reliability by isolating workflow scheduling from the execution entities of the work tasks and using a partitioned Broker cluster architecture.
In an alternative embodiment, one Broker in each logical partition is a leader, the other brokers are followers, and the followers are copies of the leader, the leader being in communication with the followers; the communication connection mode may be a heartbeat connection. As shown in the following table, an example of a distributed cluster containing 5 process engine instances, 5 logical partitions, and 2 followers or replicas per logical partition is shown:
Broker0 Broker1 Broker2 Broker3 Broker4
partition 0 Leader Follower with Follower with
Partition 1 Leader Follower with Follower with
Partition 2 Leader Follower with Follower with
Partition 3 Follower with Leader Follower with
Partition 4 Follower with Follower with Leader
Correspondingly, step S110 specifically includes:
step S1101, the leader processes the command sent by the Worker in the same logical partition.
Step S1102, a follower in the same logical partition receives a command and/or an event synchronization message of the leader in the same logical partition, writes the command and/or the event synchronization message into a log file in the same logical partition, and backs up data of each process instance in a database in the same logical partition.
In the embodiment, the scheduling tasks of different process instances are shared and executed by different logical partitions, so that the scalability of the performance level is realized, and the data of each process instance is backed up by a plurality of copies, i.e. followers, in the logical partitions, so that the high reliability of the process scheduling process is realized.
In an optional embodiment, the method further comprises: the leader is generated by dynamic election of the Atomic Broadcast (ZAB) protocol.
Specifically, in this embodiment, the partition of the logical partition of the workflow engine cluster is static, and is not allowed to be modified after the workflow engine cluster is built, but the leader in the logical partition is generated by a ZAB protocol dynamic election during the operation of the workflow engine cluster, if a certain leader finds that the heartbeat message is abnormal with more than half of the nodes, the external service is stopped, and the other nodes, i.e., brokers, of the logical partition are not stopped trying to connect, and when the more than half of the nodes and the heartbeat thereof are recovered, a new round of election is initiated to generate a new leader. Similarly, if a follower finds a heartbeat anomaly with the leader, a new round of election is initiated to elect a new leader.
In an alternative embodiment, before performing step S1101, the method further comprises: and configuring at least one seed node in each Broker, and connecting each Broker with the corresponding seed node after starting to acquire leader and follower information of each logic partition.
Specifically, in order to enable any Broker in the Worker-connected workflow engine cluster to access the cluster service, each Broker in the workflow engine cluster needs to be able to obtain information of the leader and follower of each logical partition. Therefore, the present embodiment configures several seed nodes for each Broker (the seed nodes are also brokers). And connecting the seed nodes after the breaker is started, and acquiring the latest leader and follower information of each logic partition through a Gossip protocol.
The gossip protocol message comprises three types: SYN, ACK2, and the specific acquisition process is shown in fig. 2, 3, 4, and 5.
Specifically, each node, i.e. Broker, in the workflow engine cluster has a Map for storing the state of each node in the cluster: endpointMembers, the key of the Map is gossypremember containing the node address and the state, and the value of the Map is EndPointState containing heartstate and ApplicationState. The HeartTatate has two members: the heartbeat time is the local absolute time millisecond number when the node sends the SYN message, the version is the version number, 1 is added when the SYN message is sent, when two heartbeat states are more new and old, the heartbeat time is preferentially compared, if the heartbeat time is larger, the heartbeat state is more new, and if the heartbeat time is equal, the version is compared, and if the version is larger, the heartbeat state is more new. The ApplicationState contains the current node ZAB protocol role and the partition number.
As shown in fig. 3, each node in the workflow engine cluster performs the following process of sending SYN messages every 1 second:
step S21, update the local node heartbeat state.
Step S22, scan Map: endpointMembers, construct a SYN message that contains a list of gossypists.
Wherein, gossispjoin is the abstract of a node message, contains 3 fields: node address, hearstbedstate.
And step S23, randomly selecting a node from the online node list to send the SYN message, and if the online node list is empty, skipping the step.
And step S24, randomly selecting one node from the offline node list to send the SYN message, and if the offline node list is empty, skipping the step.
And step S25, if no message is sent out or the number of online nodes is less than the number of seed nodes in the step S23, randomly selecting one node from the seed node list to send the SYN message.
Step S26, scan Map: and setting the duration as the current time of the system-node HeartTatTime of each node information in endpointMembers, if the duration is greater than a specified threshold, setting the GossipMember.
As shown in fig. 4, after receiving the SYN message sent by the node a, the node B replies to the ACK message in a specific process:
step S31, comparing the node information list sent by the node a with the local Map of the node B: endpointMembers, which generates a list of gossypitest types: olders, and Map with key gosspmember, value EndPointState: the newers.
Wherein, olders stores the gossypidest information of the node B which is older than the node B, and the gossypidest information of the node A which is not provided by the node B. The newer keeps the node B compare the new node EndPointState information and the node B has but node A does not have EndPointState information.
And step S32, constructing an ACK message to reply to the node A, wherein the ACK message comprises olders and newer data.
As shown in fig. 5, after receiving the ACK message sent by the node B, the node a replies with an ACK2 message, which includes the following specific processes:
and step S41, updating the node information stored locally according to the powers data carried by the ACK message.
Step S42, according to the olders data carried by the ACK message, inquiring the local map: endpointMembers, construction key is gossypipmember, value is the map of EndPointState: deltaEndpoints.
Step S43, construct ACK2 message to reply to the node B, ACK2 message contains deltaEndpoints data.
And step S44, the node B updates the local node information according to the deltaEndpoints data carried by the ACK2 message.
If each node communication period can select a new node, the process of acquiring node (leader and follower) information through the Gossip protocol is degenerated into a binary search process, each period forms a balanced binary tree, the convergence speed is O (n ^2), and the corresponding time overhead is O (logn).
In an optional embodiment, step S110 specifically includes:
step S1103, the leader receives a command sent by the Worker in the same logical partition and puts the command into a queue of the memory.
And step S1104, the leader takes out the command from the queue, generates a transaction log, adds the transaction log to a log file of the same logic partition, and copies the transaction log to followers of the same logic partition.
In an optional embodiment, step S110 specifically includes:
step S1105, the leader calls the flow engine state machine to process the command sent by the Worker in the same logic partition, updates the flow instance data in the database of the same logic partition, and generates a task event to return to the Worker in the same logic partition;
step S1106, adding task events to a queue by the leader; and taking out the task event from the queue, adding the task event into a log file of the same logical partition, and copying the task event into followers of the same logical partition.
The follower backup can be realized through the steps S1103 to S1106, and the reliability of the workflow engine is ensured.
In an optional embodiment, the method further comprises:
step S130, periodically generating a snapshot file of the full data of the instance of the database of the same logic partition by the leader, recording the current timestamp of the logic partition when the snapshot is created, and deleting the log file of which the maximum timestamp is less than the preset time;
step S140, the follower periodically pulls the latest snapshot file in the database from the leader, and deletes all log files before the maximum timestamp recorded in the snapshot file.
Specifically, at intervals (for example, 15 minutes), the leader generates a snapshot file of the full data of the database (for example, RocksDB) instance, records the current zxid (set to s _ zxid) of the logical partition when the snapshot is created, and deletes the WAL log file whose maximum zxid is smaller than s _ zxid (because the modification of the RocksDB by this part of log file can be restored to the RocksDB through the snapshot file). And (3) the logic partition follower tries to pull the latest RocksDB snapshot file from the leader at intervals, and after the latest RocksDB snapshot file is successfully pulled, all WAL log files before the maximum zxid recorded in the snapshot file are deleted locally, so that the unlimited increase of the number of the WAL log files can be prevented.
In an optional embodiment, the method further comprises:
s150, when the connection between the leader and the follower is abnormal, selecting a new leader in the same logic partition;
and step S160, the new leader reads the latest snapshot file in the database, restores the latest snapshot file to the database, reads and redos all transactions after the snapshot file from the log file, and restores the data in the database.
Specifically, when the original leader goes down and a new leader is generated by election, the new leader reads the latest snapshot file and restores the latest snapshot file to the database instance of the logical partition, reads and redoes all transactions after the snapshot file from the local log file and restores the database data without redoing all log files, so that the starting speed of the new leader is increased.
In addition, the maximum number of snapshot files reserved by the leader and the follower is also controlled by parameters, and the system only reserves the latest snapshot files designated by the user. By the aid of the method, the new leader does not need to redo all log files, and accordingly starting speed of the new leader is increased.
Example two
As shown in fig. 6, the distributed workflow engine system includes: the workflow engine Broker cluster 11 and the Worker 12; the Broker cluster 11 is divided into a plurality of logical partitions, each logical partition includes a plurality of brokers, when a process is deployed, all the brokers of the Broker cluster 11 load process definitions, the brokers in the same logical partition store the same process instance data, the brokers in different logical partitions store different process instance data, and different logical partitions execute scheduling tasks of different process instances; wherein, the number of the Worker12 can be multiple.
The Broker is used for flow scheduling and task publishing, so that each task in a flow instance is executed by a unique Worker 12;
the Worker12 is used to pull the executable task from the Broker cluster 11, execute and return the execution result to the corresponding Broker, and drive the process to continue executing.
The embodiment supports performance level scalability and reliability by isolating workflow scheduling from the execution entities of the work tasks and using a partitioned distributed workflow engine cluster architecture.
In an alternative embodiment, each logical partition includes a plurality of brokers, all of the brokers of the Broker cluster load the flow definitions when a flow is deployed, the brokers in the same logical partition store the same flow instance data, and the brokers in different logical partitions store different flow instance data.
In an alternative embodiment, each logical partition has a respective log file and database. Wherein the log file may be a Write-Ahead Logging (WAL) file. The database can be a storage database RocksDB, and the reliability and the efficiency of data transmission and storage can be ensured through the RocksDB.
In this embodiment, each logical partition has its own WAL log file and RocksDB, and the RocksDB only stores the flow definitions of the deployment flows and the executing flow instance data, but does not store the completed flow instance data, thereby reducing the data amount stored in the RocksDB and improving the reading performance.
In an alternative embodiment, one Broker in each logical partition is a leader, the remaining brokers are followers, and the followers are copies of the leader, the leader being communicatively coupled to the followers; the communication connection mode may be a heartbeat connection.
The leader is used for processing a command sent by a Worker in the same logic partition;
and the follower in the same logical partition is used for receiving the command and/or the event synchronization message of the leader in the same logical partition, writing the command and/or the event synchronization message into the log file in the same logical partition, and backing up the data of each process instance in the database in the same logical partition, so that the high reliability of the process scheduling process is realized. And, reliability of the workflow engine is guaranteed by applying leader and follower patterns into the logical partitions.
In an alternative embodiment, the leader is generated by dynamic election of the Atomic Broadcast (ZAB) protocol.
Specifically, in this embodiment, the partition of the logical partition of the workflow engine cluster is static, and is not allowed to be modified after the workflow engine cluster is built, but the leader in the logical partition is generated by a ZAB protocol dynamic election during the operation of the workflow engine cluster, if a certain leader finds that the heartbeat message is abnormal with more than half of the nodes, the external service is stopped, and the other nodes, i.e., brokers, of the logical partition are not stopped trying to connect, and when the more than half of the nodes and the heartbeat thereof are recovered, a new round of election is initiated to generate a new leader. Similarly, if a follower finds a heartbeat anomaly with the leader, a new round of election is initiated to elect a new leader.
In an optional embodiment, each Broker is configured with at least one seed node, and each Broker is connected with the corresponding seed node after being started to acquire leader and follower information of each logical partition.
Specifically, in order to enable any Broker in the Worker-connected workflow engine cluster to access the cluster service, each Broker in the workflow engine cluster needs to be able to obtain information of the leader and follower of each logical partition. Therefore, the present embodiment configures several seed nodes for each Broker (the seed nodes are also brokers). And connecting the seed nodes after the breaker is started, and acquiring the latest leader and follower information of each logic partition through a Gossip protocol.
In an optional embodiment, the leader is specifically configured to receive a command sent by a Worker in the same logical partition and place the command into a queue of a memory; and taking out the command from the queue, generating a transaction log, adding the transaction log into a log file of the same logical partition, and copying the transaction log into followers of the same logical partition.
In an optional embodiment, the leader is specifically configured to invoke the process engine state machine to process the command, update process instance data in the database of the same logical partition, and generate a task event to return to the Worker in the same logical partition.
In an optional embodiment, the leader is specifically configured to add the task event to the queue; and taking out the task event from the queue, adding the task event into a log file of the same logical partition, and copying the task event into each follower of the same logical partition to realize backup of the followers.
In an optional embodiment, the leader is further configured to periodically generate a snapshot file of the full-volume data of the instances of the database of the same logical partition, record a current timestamp of the logical partition when the snapshot is created, and delete a log file whose maximum timestamp is less than a preset time;
and the follower is also used for periodically pulling the latest snapshot file in the database from the leader and deleting all log files before the maximum timestamp recorded in the snapshot file.
In an optional embodiment, when the connection between the leader and the follower is abnormal, a new leader is elected in the same logical partition, the new leader reads the latest snapshot file in the database, restores the latest snapshot file to the database, reads and redoes all transactions after the snapshot file from the log file, restores the data in the database, and the process does not need to redo all log files, thereby accelerating the starting speed of the new leader.
EXAMPLE III
Fig. 7 is a schematic structural diagram illustrating a distributed workflow engine system according to a second embodiment of the present invention. The embodiment is a specific embodiment, as shown in fig. 7, according to the difference of the execution functions, the execution entity is divided into a workflow engine Broker and a Worker, and a plurality of workflow engine brokers form a Broker cluster. The Broker is only responsible for flow scheduling and task publishing so that each task in a flow instance is executed by a unique Worker.
The Worker Worker comprises two function modules, namely a user-defined logic component and a distributed engine client extension, wherein the distributed engine client extension is used for realizing a client function, the client assists a Broker to register a Worker, pull an executable task from a Broker cluster and return a task execution result (including task completion or task execution failure) to the Broker cluster, and the user-defined logic component realizes the task execution by calling a client interface.
Due to the low coupling degree of the Broker and the Worker, the expandability is strong.
And, dividing the Broker cluster into a plurality of logical partitions, each logical partition including a plurality of Brokers, only one of the Brokers being a leader, and having a connection with all followers of the logical partition, where the connection transmits a bidirectional heartbeat message or other service message. Brokers in the same logical partition store the same process instance data. When the process is deployed, all the leaders and followers of the workflow engine cluster load the process definition, and the Brokers of different logical partitions store different process instance data. In an alternative embodiment, the partition of the logical partition of the workflow engine cluster is static, and is not allowed to be modified after the workflow engine cluster is built, but the leader in the logical partition is generated by dynamic election through ZAB protocol during the operation of the workflow engine cluster, if a certain leader finds that the heartbeat messages between the certain leader and more than half of the nodes are abnormal, the external service is stopped, other nodes in the same logical partition are not stopped trying to be connected, and when more than half of the nodes and the heartbeat of the nodes are recovered, a new round of election is initiated to generate a new leader. Similarly, if a follower finds a heartbeat anomaly with the leader, a new round of election is initiated to elect a new leader.
The method realizes the scalability of performance level by sharing and executing the scheduling tasks of different process instances through different logical partitions, and realizes the high reliability of the process scheduling process by backing up the data of each process instance through a plurality of copies, namely followers, in the logical partitions.
In an alternative embodiment, the Worker is connected to any Broker in the workflow engine cluster, and needs to access the cluster service, so that each node in the workflow engine cluster needs to store the leader/follower information of each logical partition of the current cluster. For this purpose, each Broker is provided with a number of seed nodes, which are also the brokers. The method comprises the steps that a Broker is connected with seed nodes after being started, latest leader and follower information of each logic partition is obtained through a gossip protocol, in addition, a process instance ID is designed to be a long and integer type containing a logic partition number, a client sends a command or an event with the process instance ID to the Broker, the Broker extracts the logic partition number in the process instance ID and forwards the command to a leader of a corresponding logic partition.
In an optional embodiment, after the Broker is started, the seed node is connected, and the latest leader and follower information of each logical partition is acquired through a gossip protocol, wherein a gossip protocol message comprises three types: SYN, ACK2, and the specific acquisition process is shown in fig. 2, 3, 4, and 5.
In an alternative embodiment, taking the processing procedure of a task ending command sent by a client to a leader as an example, a leader event loop design is described, as shown in fig. 8, which includes the following steps:
in step S51, the client sends a command or event to end execution of the flow task.
In step S52, the leader receives the command or event and puts it into a queue in memory.
Step S53, the leader takes out the command or event from the queue, generates a transaction id (zxid, 64-bit integer number, the upper 32 bits are electionEpoch, 1 is added during each round of election, the lower 32 bits are the transaction id of the corresponding command or event, and the transaction id is monotonically increased) and adds to the local WAL log file, and copies to each follower of the logical partition through the ZAB protocol.
Step S54, after step S53 is completed, the command or event has been reliably stored in more than half of the nodes in the logical partition, i.e. the Broker, and the leader then calls the process engine state machine to process the command or event, update the process instance data in the RocksDB, and generate a new process task end event, whose zxid is the current zxid + 1.
And step S55, returning the generated task execution end event to the client.
And step S56, adding a flow task end event to the queue.
Step S57, the task end event is taken out from the queue, added to the local WAL log file, and copied to each follower of the logical partition by ZAB protocol.
Wherein the follower only receives command/event synchronization messages from the logical partition leader and writes to the local WAL log file, does not start the rocksDB database instance of the partition, nor does it issue workflow tasks or respond to commands from the client.
Wherein each logical partition has a respective WAL log file and RocksDB database instance, and transaction logs are generated by the leader and reliably replicated to the partition followers via ZAB protocols. In order to ensure the orderliness of message processing, the leader of each logic partition only allocates one thread to process the event loop, and the single-thread design also avoids the complex multi-thread synchronization problem.
The queue circularly used by the leader event can be a ring queue RingBuffer, and the RingBuffer is a memory cache queue supporting parallel reading and writing of a producer and a consumer. After the ring queue is used for replacement, the original event cycle work of the leader can be completed by two threads: one production thread is responsible for reading network messages and putting into the RingBuffer, one consumption thread reads commands/events from the RingBuffer in batch, and then submits the commands/events in batch through the ZAB protocol, so that the performance of ZAB synchronization is improved.
Because the content is added to the tail of the WAL log file every time the WAL log file is modified, and the transaction log is recorded by using a disk sequential writing mode, the performance is good. The RocksDB is an embedded key value processor, adopts Log-Structured Merge-Tree, is a combination of various data structures mixed by a memory and a disk, converts random writing of a user into sequential writing of the disk, utilizes the cache function of the memory, has good read-write performance, only stores deployed flow definition and executing flow instance data, but not stores finished flow instance data, so that the data quantity stored in the RocksDB is not large, and the read performance is good.
In an optional embodiment, in the above flow, if the first 4 or 5 steps are completed, before the 6 th step, if the leader goes down, the following processing operations are performed: since the commands or events have been written synchronously into most followers of the logical partition, when the newly elected leader starts, it is compared whether the largest zxid in the rocksbb lags behind the largest zxid recorded by the local WAL log file, if so, the unexecuted commands/events in the WAL log file are taken out and redone, and the data is updated to the rocksbb. The redo process will not respond to the client and will not synchronize redo commands/events to other nodes within the partition.
In an alternative embodiment, as the workflow engine cluster continues to process command/event messages, the WAL log files of the logical partitions are increasing, and the WAL log file writing performance is getting worse. Therefore, before writing the WAL log file, the leader and the follower check whether the size of the current WAL log file exceeds a preset threshold (for example, 512MB), if so, the current WAL log file is closed to generate a new WAL log file, and the follow-up transaction log is added to the new WAL log file, so that a plurality of WAL log files of one logic partition are available.
In an alternative embodiment, the leader may generate a snapshot file of the entire data of the RocksDB database instance at intervals (e.g., 15 minutes), record the current zxid (set to s _ zxid) of the logical partition when the snapshot is created, and delete the WAL log file whose maximum zxid is smaller than s _ zxid (because the modifications to the RocksDB by this part of log file can be restored to the RocksDB through the snapshot file). And (3) the logic partition follower tries to pull the latest RocksDB snapshot file from the leader at intervals, and after the latest RocksDB snapshot file is successfully pulled, all WAL log files before the maximum zxid recorded in the snapshot file are deleted locally, so that the unlimited increase of the number of the WAL log files can be prevented.
When the original leader goes down and a new leader is generated by election, the new leader reads the latest snapshot file and restores the latest snapshot file to the instance of the RocksDB database of the logical partition, reads and redoes all transactions after the snapshot file from the local WAL log file and restores the RocksDB data without redoing all WAL log files, so that the starting speed of the new leader is accelerated.
In addition, the maximum number of snapshot files reserved by the leader and the follower is also controlled by parameters, and the system only reserves the latest snapshot files designated by the user.
The embodiment isolates the workflow scheduling from the execution entity of the work task, and uses a distributed workflow engine cluster architecture supporting partitioning and replication, thereby supporting the performance level expandability and reliability. The ZAB protocol is utilized to realize the efficient and reliable writing of data among different copies in the same partition, and the local embedded key value memory RocksDB is utilized to realize the efficient snapshot storage and the data instant query capability.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims (10)

1. A distributed job scheduling method, comprising:
the Broker in the Broker cluster carries out flow scheduling and task publishing, so that each task in the flow example is executed by a unique Worker; the Broker cluster is divided into a plurality of logic partitions, and different logic partitions execute scheduling tasks of different process instances;
and the Worker pulls the executable task from the Broker cluster, executes and returns the execution result to the corresponding Broker, and the driving process continues to execute.
2. The method of claim 1, wherein one Broker in each logical partition is a leader, the remaining brokers are followers, and the followers are copies of the leader, the leader communicatively coupled to the followers;
correspondingly, the process scheduling and task publishing performed by the Broker in the Broker cluster includes:
the leader processes a command sent by a Worker in the same logical partition;
and the follower in the same logic partition receives the command and/or the event synchronization message of the leader in the same logic partition, writes the command and/or the event synchronization message into a log file in the same logic partition, and backs up the data of each process instance in a database in the same logic partition.
3. The method of claim 2, wherein the leader processes commands sent by a Worker in the same logical partition, comprising:
the leader receives a command sent by a Worker in the same logic partition and puts the command into a queue of a memory;
and the leader takes out the command from the queue, generates a transaction log, adds the transaction log to a log file of the same logic partition, and copies the transaction log to followers of the same logic partition.
4. The method of claim 2, wherein the leader processes commands sent by a Worker in the same logical partition, comprising:
the leader calls a flow engine state machine to process a command sent by a Worker in the same logic partition, updates flow instance data in a database of the same logic partition, generates a task event and returns the task event to the Worker in the same logic partition;
the leader adding the task event to the queue; and taking out the task event from the queue, adding the task event into a log file of the same logical partition, and copying the task event into followers of the same logical partition.
5. The method of claim 2, further comprising:
the leader periodically generates a snapshot file of the full data of the instance of the database of the same logical partition, records the current timestamp of the logical partition when creating the snapshot, and deletes the log file of which the maximum timestamp is less than the preset time;
and the follower periodically pulls the latest snapshot file in the database from the leader and deletes all log files before the maximum timestamp recorded in the snapshot file.
6. The method of claim 5, further comprising:
when the connection between the leader and the follower is abnormal, electing a new leader in the same logic partition;
and the new leader reads the latest snapshot file in the database, restores the latest snapshot file to the database, reads and redos all transactions after the snapshot file from the log file, and restores the data in the database.
7. A distributed workflow engine system, comprising: a workflow engine Broker cluster and a Worker; the Broker cluster is divided into a plurality of logic partitions, and different logic partitions execute scheduling tasks of different process instances;
the Broker is used for scheduling the process and issuing the tasks, so that each task in the process instance is executed by a unique Worker;
the Worker is used for pulling the executable task from the Broker cluster, executing and returning the execution result to the corresponding Broker, and driving the process to continue executing.
8. The distributed workflow engine system of claim 7 wherein each logical partition contains a plurality of Brokers, wherein all Brokers of the Broker cluster load flow definitions at flow deployment, wherein Brokers in the same logical partition store the same flow instance data, and wherein Brokers in different logical partitions store different flow instance data.
9. The distributed workflow engine system of claim 7 wherein one Broker in each logical partition is a leader, the remaining brokers are followers and the followers are copies of the leader, the leader communicatively connected to the followers;
the leader is used for processing a command sent by a Worker in the same logic partition;
and the follower in the same logical partition is used for receiving the command and/or the event synchronization message of the leader in the same logical partition, writing the command and/or the event synchronization message into a log file in the same logical partition, and backing up the data of each process instance in a database in the same logical partition.
10. The distributed workflow engine system of claim 9, wherein the leader is specifically configured to receive a command sent by a Worker in the same logical partition and place the command into a queue of a memory; and taking out the command from the queue, generating a transaction log, adding the transaction log into a log file of the same logical partition, and copying the transaction log into followers of the same logical partition.
CN202111220515.9A 2021-10-20 2021-10-20 Distributed work scheduling method and distributed workflow engine system Pending CN113900788A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111220515.9A CN113900788A (en) 2021-10-20 2021-10-20 Distributed work scheduling method and distributed workflow engine system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111220515.9A CN113900788A (en) 2021-10-20 2021-10-20 Distributed work scheduling method and distributed workflow engine system

Publications (1)

Publication Number Publication Date
CN113900788A true CN113900788A (en) 2022-01-07

Family

ID=79193084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111220515.9A Pending CN113900788A (en) 2021-10-20 2021-10-20 Distributed work scheduling method and distributed workflow engine system

Country Status (1)

Country Link
CN (1) CN113900788A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150263A (en) * 2022-10-11 2023-05-23 中国兵器工业计算机应用技术研究所 Distributed graph calculation engine
CN117667362A (en) * 2024-01-31 2024-03-08 上海朋熙半导体有限公司 Method, system, equipment and readable medium for scheduling process engine
CN117667362B (en) * 2024-01-31 2024-04-30 上海朋熙半导体有限公司 Method, system, equipment and readable medium for scheduling process engine

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150263A (en) * 2022-10-11 2023-05-23 中国兵器工业计算机应用技术研究所 Distributed graph calculation engine
CN117667362A (en) * 2024-01-31 2024-03-08 上海朋熙半导体有限公司 Method, system, equipment and readable medium for scheduling process engine
CN117667362B (en) * 2024-01-31 2024-04-30 上海朋熙半导体有限公司 Method, system, equipment and readable medium for scheduling process engine

Similar Documents

Publication Publication Date Title
CN109739935B (en) Data reading method and device, electronic equipment and storage medium
US8527459B2 (en) System and method for data replication between heterogeneous databases
CN106598762B (en) Message synchronization method
US9639588B2 (en) Operation method and apparatus for data storage system
US4868744A (en) Method for restarting a long-running, fault-tolerant operation in a transaction-oriented data base system without burdening the system log
US7779295B1 (en) Method and apparatus for creating and using persistent images of distributed shared memory segments and in-memory checkpoints
JP4594928B2 (en) Flashback database
US10430298B2 (en) Versatile in-memory database recovery using logical log records
US9223805B2 (en) Durability implementation plan in an in-memory database system
US7490113B2 (en) Database log capture that publishes transactions to multiple targets to handle unavailable targets by separating the publishing of subscriptions and subsequently recombining the publishing
CN109992628B (en) Data synchronization method, device, server and computer readable storage medium
CN108509462B (en) Method and device for synchronizing activity transaction table
US20090063807A1 (en) Data redistribution in shared nothing architecture
US8103643B2 (en) System and method for performing distributed transactions using global epochs
CN112597249B (en) Synchronous distribution storage method and system for service data
EP2260410A1 (en) Log based replication of distributed transactions using globally acknowledged commits
CN103870570A (en) HBase (Hadoop database) data usability and durability method based on remote log backup
US20200226011A1 (en) Policy-based distributed transactional processing in a distributed system
CN112612799B (en) Data synchronization method and terminal
EP4276651A1 (en) Log execution method and apparatus, and computer device and storage medium
CN113900788A (en) Distributed work scheduling method and distributed workflow engine system
CN112822091B (en) Message processing method and device
CN108090056B (en) Data query method, device and system
CN109783578A (en) Method for reading data, device, electronic equipment and storage medium
CN115617571A (en) Data backup method, device, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination