CN105095008B

CN105095008B - A kind of distributed task scheduling fault redundance method suitable for group system

Info

Publication number: CN105095008B
Application number: CN201510528462.5A
Authority: CN
Inventors: 苏大威; 高原; 徐春雷; 任升; 顾文杰; 方华建; 庄卫金; 孟勇亮; 余璟; 江叶峰; 仇晨光; 吴海伟; 孙名扬; 孙世明; 沙川; 沙一川
Original assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd; Nanjing NARI Group Corp
Current assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd; Nanjing NARI Group Corp
Priority date: 2015-08-25
Filing date: 2015-08-25
Publication date: 2018-04-17
Anticipated expiration: 2035-08-25
Also published as: CN105095008A

Abstract

The invention discloses a kind of distributed task scheduling fault redundance method suitable for group system, there is provided a kind of two-stage task failure redundancy feature, achievees the purpose that lifting task high reliability, system high-available and user friendly.The beneficial effect that the present invention is reached：1st, task reliability improve, distributed task scheduling in the cluster operation troubles when can recover in time in node, between node, improve the reliability of cluster distributed task；2nd, the availability of system improves, and management program employs master-slave redundancy technology, and task failure Redundancy Management is transparent, the presence of user's imperceptible task failure redundancy in use for a user；3rd, it is portable good, it is not necessary to carry software by any operating system；4th, there is cross-platform ability, service routine can be deployed on different operating system servers；5th, using simple, user only needs to call several interfaces that fault redundance can be used.

Description

A kind of distributed task scheduling fault redundance method suitable for group system

Technical field

The present invention relates to a kind of distributed task scheduling fault redundance method suitable for group system, belong to computer system skill Art field.

Background technology

With flourishing for cloud computing distributed system, distributed system is just used by more and more users. And distributed system forms cluster by substantial amounts of computer and provides a user service, with the increase of number of computers, system goes out The probability of existing mistake greatly increases.Therefore it may cause to run user task error on the nodes, be more likely to user Bring immeasurable loss.Therefore, distributed task scheduling fault redundance for the high reliability of distributed system, high availability with And user friendly etc. is indispensable.

The content of the invention

To solve the deficiencies in the prior art, it is an object of the invention to provide a kind of lifting task high reliability, system are high The distributed task scheduling fault redundance method suitable for group system of availability and user friendly, can be fast and accurately Detect the fault condition of distributed task scheduling, and it is main syllabus that can restart the failure task in the other nodes of cluster in time 's；It is simple, transparent for secondary objective to use user.

In order to realize above-mentioned target, the present invention adopts the following technical scheme that：

A kind of distributed task scheduling fault redundance method suitable for group system, it is characterized in that, specifically include following steps：

1) fault redundance management is received by external interface：The external interface is supplied to upper strata task call, by task Information is added in fault redundance task queue, upper strata task and then acquisition fault redundance management；

2) task failure redundancy in node：Management program in cluster on each node is responsible for safeguarding the task on this node Information, fault redundance is carried out to being run on this node for task, and is responsible for arriving the mission bit stream synchronized update on this node On cluster management node；

3) mission bit stream is synchronous：The mission bit stream synchronized update on this node is aggregated into cluster pipe by each node in cluster Manage on node；

4) task failure redundancy between node：Management program on cluster management node is responsible for safeguarding the task in whole cluster Information, fault redundance carrying out node to the task of failure, by the mission bit stream synchronized update on cluster management node to spare Node；Between node after fault redundance success, task recovery successful information is synchronous at once from cluster management node to malfunctioning node；

5) cluster management node is elected：There are more cluster management standby nodes in cluster, when cluster management node failure When, an available node is elected from spare management node immediately and cluster management function is externally provided, reach cluster management section Point failure redundancy；

6) mission bit stream standby redundancy：Mission bit stream synchronized update on cluster management node is on standby node；

7) fault redundance management is exited by exiting interface：Described exit when interface is supplied to upper strata task to exit is called, Mission bit stream is deleted out of fault redundance task queue, upper strata task and then exits fault redundance management.

A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, the step 2) In, the maintenance to mission bit stream includes the renewal to mission bit stream and delete operation；By polling mode come Detection task whether Failure, when task failure occurs, to task recovery in node, limits if recovering number and exceeding configured number, Task failure redundancy fails in node.

A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, in the step 3) mission bit stream is from every node to cluster management node periodic synchronous in；After the failure of node internal fault redundancy, mission bit stream It is synchronous at once from malfunctioning node to cluster management node.

A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, the step 4) In maintenance to mission bit stream include addition, renewal and delete operation to mission bit stream；After the failure of node internal fault redundancy, Fault redundance between progress node, cluster management node first looks at whether task has the starter node set specified, if then The node of a node load minimum is selected to recover the task in node set is specified, if not specifying node set, The node of a node load minimum is selected to recover the task in the cluster.

A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, the step 5) Middle cluster management node periodically sends heartbeat multicast message, and standby node periodically receives the heartbeat message；When spare After node does not receive heartbeat message more than certain time, cluster management node failure is judged, standby node is upgraded to cluster management section Point；After a certain cluster management node receives other clustered node management node heartbeat messages, present node can be with sending heartbeat Message node carries out IP address integer value and compares, and when the machine numerical value is big, then the machine is reduced to standby node and stops sending heartbeat report Text, when the machine numerical value is small, continues to send heartbeat message until not receiving the heart of other cluster management nodes within a certain period of time Message is jumped, then the node is saved as new cluster management node and by each of the cluster management node information notification to cluster Point.

A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, the step 6) Middle cluster management node periodically by mission bit stream synchronized update to standby node, and when have task status occur addition or Exit or during the change of failure at once by job change synchronizing information renewal to standby node.

The beneficial effect that the present invention is reached：1st, the reliability of task improves, distributed task scheduling operation troubles in the cluster When can recover in time in node, between node, improve the reliability of cluster distributed task；2nd, the availability of system carries Height, management program employs master-slave redundancy technology, and task failure Redundancy Management is transparent for a user, and user makes The presence of imperceptible task failure redundancy during；3rd, it is portable good, it is not necessary to be carried by any operating system soft Part；4th, there is cross-platform ability, service routine can be deployed on different operating system servers；5th, using simple, user Only need to call several interfaces that fault redundance can be used；6th, deployment is simple, it is only necessary to which disposing management program, dynamic base can transport OK.

Brief description of the drawings

Fig. 1 is task state transition figure in the present invention；

Fig. 2 is communication scheme between interior joint of the present invention；

Fig. 3 is cluster management node election algorithm flow chart in the present invention；

Embodiment

The invention will be further described below in conjunction with the accompanying drawings.Following embodiments are only used for clearly illustrating the present invention Technical solution, and be not intended to limit the protection scope of the present invention and limit the scope of the invention.

The present invention relates to a kind of distributed task scheduling fault redundance method suitable for group system, comprise the following steps：

1) fault redundance management is received by external interface：External interface is supplied to upper strata task call, by mission bit stream It is added in fault redundance task queue, upper strata task and then acquisition fault redundance management.

2) task failure redundancy in node：Management program in cluster on each node is responsible for safeguarding the task on this node Information, fault redundance is carried out to being run on this node for task, and is responsible for arriving the mission bit stream synchronized update on this node On cluster management node；Maintenance to mission bit stream includes the renewal to mission bit stream and delete operation；By polling mode come Detection task whether failure, when task failure occurs, to task recovery in node, if recovering number exceedes what is configured Number limits, then task failure redundancy fails in node.

3) mission bit stream is synchronous：The mission bit stream synchronized update on this node is aggregated into cluster pipe by each node in cluster Manage on node；Mission bit stream is from every node to cluster management node periodic synchronous；After the failure of node internal fault redundancy, task Information is synchronous at once from malfunctioning node to cluster management node.

4) task failure redundancy between node：Management program on cluster management node is responsible for safeguarding the task in whole cluster Information, fault redundance carrying out node to the task of failure, by the mission bit stream synchronized update on cluster management node to spare Node；Between node after fault redundance success, task recovery successful information is synchronous at once from cluster management node to malfunctioning node；It is right The maintenance of mission bit stream includes addition, renewal and delete operation to mission bit stream；After the failure of node internal fault redundancy, carry out Fault redundance between node, cluster management node first look at whether task has the starter node set specified, if then referring to Determine to select the node of a node load minimum to recover the task in node set, if not specifying node set, collecting The node of a node load minimum is selected to recover the task in group.

5) cluster management node is elected：There are more cluster management standby nodes in cluster, when cluster management node failure When, an available node is elected from spare management node immediately and cluster management function is externally provided, reach cluster management section Point failure redundancy；Cluster management node periodically sends heartbeat multicast message, and standby node periodically receives the heartbeat report Text；After standby node does not receive heartbeat message more than certain time, cluster management node failure is judged, standby node is upgraded to collection Group's management node；After a certain cluster management node receives other clustered node management node heartbeat messages, present node can be with Send heartbeat message node progress IP address integer value to compare, when the machine numerical value is big, then the machine is reduced to standby node and stops sending out Heartbeat message is sent, when the machine numerical value is small, continues to send heartbeat message until not receiving other cluster managements within a certain period of time The heartbeat message of node, then the node is as new cluster management node and by the cluster management node information notification to cluster Each node.

6) mission bit stream standby redundancy：Mission bit stream synchronized update on cluster management node is on standby node；Cluster Management node periodically by mission bit stream synchronized update to standby node, and when have task status occur addition or exit or Job change synchronizing information is updated to standby node at once during the change of failure.

As shown in Figure 1, after calling registration interface to succeed in registration after task start, task is normal condition；This method can week Phase property carries out fault detect to normal tasks, and task normally then waits next cycle detection；If task is detected as failure, appoint It is engaged in as malfunction；This method carries out node internal fault redundancy to failure task, if it is successful, task returns to normal condition, If it fails, fault redundance is carried out then carrying out node to failure task, if it is successful, task returns to normal condition, if Fail, then mission failure.Called after task run and exit interface, task is to exit state.

As shown in Fig. 2, when there is no management node in cluster, spare management node can send management node EB packet into The election of row management node.After electing successfully, management node can periodically send heartbeat message, then task run node can be with Know management node address, spare management node can know the state of management node.Task run node can be to management node Periodic transmission task collects message, and management node can be to spare management node signalling of bouquet task to received mission bit stream Backup message, to reach the progress cluster task information backup redundancy purpose in spare management node.When task status occurs During change, such as：From being normally changed into failure, from normally being changed into exiting.Task run node can send task urgent messages always To management node, stop transmission task urgent messages after the task emergency answering message of management node return is received；For pipe Node is managed, after task urgent messages are received, spare management node can be transmitted to, when the receiving spare management node return of the task Stopped forwarding after emergency answering message.

As shown in figure 3, after management program starts, collection can be received always in 4T (1T cluster management nodes heart beat message cycles) Group's management node heartbeat message.If receiving message in 4T, the role of the machine is the spare management node of cluster；If in 4T In do not receive yet, then the machine is upgraded to cluster management node, and sends heartbeat message always in 4T.If receive other cluster pipes The heartbeat message of node is managed, then judges whether the machine will be reduced to the spare management node of cluster according to certain algorithm；Otherwise continue Heartbeat message is sent within the time of 4T, until can not receive other cluster management nodes heart beat messages in 4T or the machine is reduced to collection The spare management node of group.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, some improvement and deformation can also be made, these are improved and deformation Also it should be regarded as protection scope of the present invention.

Claims

1. a kind of distributed task scheduling fault redundance method suitable for group system, it is characterized in that, specifically include following steps：

1）Fault redundance management is received by external interface：The external interface is supplied to upper strata task call, by mission bit stream It is added in fault redundance task queue, upper strata task and then acquisition fault redundance management；

2）Task failure redundancy in node：Management program in cluster on each node is responsible for safeguarding the task letter on this node Breath, fault redundance is carried out to being run on this node for task, and is responsible for the mission bit stream synchronized update on this node to collection In group's management node；Maintenance to mission bit stream includes the renewal to mission bit stream and delete operation；Examined by polling mode Survey task whether failure, when task failure occurs, to task recovery in node, if recovering number exceedes time configured Number limitation, then task failure redundancy fails in node；

3）Mission bit stream is synchronous：The mission bit stream synchronized update on this node is aggregated into cluster management section by each node in cluster Point on；

4）Task failure redundancy between node：Management program on cluster management node is responsible for safeguarding the task letter in whole cluster Breath, fault redundance carrying out node to the task of failure, by the mission bit stream synchronized update on cluster management node to spare section Point；Between node after fault redundance success, task recovery successful information is synchronous at once from cluster management node to malfunctioning node, right The maintenance of mission bit stream includes addition, renewal and delete operation to mission bit stream；After the failure of node internal fault redundancy, carry out Fault redundance between node, cluster management node first look at whether task has the starter node set specified, if then referring to Determine to select the node of a node load minimum to recover the task in node set, if not specifying node set, collecting The node of a node load minimum is selected to recover the task in group；

5）Cluster management node is elected：There are more cluster management standby nodes in cluster, when cluster management node failure, stand An available node is elected from spare management node and cluster management function is externally provided, reach cluster management node failure Redundancy；

6）Mission bit stream standby redundancy：Mission bit stream synchronized update on cluster management node is on standby node；

7）Fault redundance management is exited by exiting interface：Described exit when interface is supplied to upper strata task to exit is called, and will be appointed Business information is deleted out of fault redundance task queue, upper strata task and then exits fault redundance management.

2. a kind of distributed task scheduling fault redundance method suitable for group system according to claim 1, it is characterized in that, In the step 3）Middle mission bit stream is from every node to cluster management node periodic synchronous；Node internal fault redundancy fails Afterwards, mission bit stream is synchronous at once from malfunctioning node to cluster management node.

3. a kind of distributed task scheduling fault redundance method suitable for group system according to claim 1, it is characterized in that, The step 5）Middle cluster management node periodically sends heartbeat multicast message, and standby node periodically receives the heartbeat report Text；After standby node does not receive heartbeat message more than certain time, cluster management node failure is judged, standby node is upgraded to collection Group's management node；After a certain cluster management node receives other clustered node management node heartbeat messages, present node can be with Send heartbeat message node progress IP address integer value to compare, when the machine numerical value is big, then the machine is reduced to standby node and stops sending out Heartbeat message is sent, when the machine numerical value is small, continues to send heartbeat message until not receiving other cluster managements within a certain period of time The heartbeat message of node, then the node is as new cluster management node and by the cluster management node information notification to cluster Each node.

4. a kind of distributed task scheduling fault redundance method suitable for group system according to claim 1, it is characterized in that, The step 6）Middle cluster management node periodically by mission bit stream synchronized update to standby node, and when there is task status Addition occurs or exits or during the change of failure at once by job change synchronizing information renewal to standby node.