CN105095008B - A kind of distributed task scheduling fault redundance method suitable for group system - Google Patents

A kind of distributed task scheduling fault redundance method suitable for group system Download PDF

Info

Publication number
CN105095008B
CN105095008B CN201510528462.5A CN201510528462A CN105095008B CN 105095008 B CN105095008 B CN 105095008B CN 201510528462 A CN201510528462 A CN 201510528462A CN 105095008 B CN105095008 B CN 105095008B
Authority
CN
China
Prior art keywords
node
task
cluster
bit stream
management
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510528462.5A
Other languages
Chinese (zh)
Other versions
CN105095008A (en
Inventor
苏大威
高原
徐春雷
任升
顾文杰
方华建
庄卫金
孟勇亮
余璟
江叶峰
仇晨光
吴海伟
孙名扬
孙世明
沙川
沙一川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
Nari Technology Co Ltd
NARI Nanjing Control System Co Ltd
Nanjing NARI Group Corp
Original Assignee
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
Nari Technology Co Ltd
NARI Nanjing Control System Co Ltd
Nanjing NARI Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Jiangsu Electric Power Co Ltd, Nari Technology Co Ltd, NARI Nanjing Control System Co Ltd, Nanjing NARI Group Corp filed Critical State Grid Corp of China SGCC
Priority to CN201510528462.5A priority Critical patent/CN105095008B/en
Publication of CN105095008A publication Critical patent/CN105095008A/en
Application granted granted Critical
Publication of CN105095008B publication Critical patent/CN105095008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Hardware Redundancy (AREA)

Abstract

The invention discloses a kind of distributed task scheduling fault redundance method suitable for group system, there is provided a kind of two-stage task failure redundancy feature, achievees the purpose that lifting task high reliability, system high-available and user friendly.The beneficial effect that the present invention is reached:1st, task reliability improve, distributed task scheduling in the cluster operation troubles when can recover in time in node, between node, improve the reliability of cluster distributed task;2nd, the availability of system improves, and management program employs master-slave redundancy technology, and task failure Redundancy Management is transparent, the presence of user's imperceptible task failure redundancy in use for a user;3rd, it is portable good, it is not necessary to carry software by any operating system;4th, there is cross-platform ability, service routine can be deployed on different operating system servers;5th, using simple, user only needs to call several interfaces that fault redundance can be used.

Description

A kind of distributed task scheduling fault redundance method suitable for group system
Technical field
The present invention relates to a kind of distributed task scheduling fault redundance method suitable for group system, belong to computer system skill Art field.
Background technology
With flourishing for cloud computing distributed system, distributed system is just used by more and more users. And distributed system forms cluster by substantial amounts of computer and provides a user service, with the increase of number of computers, system goes out The probability of existing mistake greatly increases.Therefore it may cause to run user task error on the nodes, be more likely to user Bring immeasurable loss.Therefore, distributed task scheduling fault redundance for the high reliability of distributed system, high availability with And user friendly etc. is indispensable.
The content of the invention
To solve the deficiencies in the prior art, it is an object of the invention to provide a kind of lifting task high reliability, system are high The distributed task scheduling fault redundance method suitable for group system of availability and user friendly, can be fast and accurately Detect the fault condition of distributed task scheduling, and it is main syllabus that can restart the failure task in the other nodes of cluster in time 's;It is simple, transparent for secondary objective to use user.
In order to realize above-mentioned target, the present invention adopts the following technical scheme that:
A kind of distributed task scheduling fault redundance method suitable for group system, it is characterized in that, specifically include following steps:
1) fault redundance management is received by external interface:The external interface is supplied to upper strata task call, by task Information is added in fault redundance task queue, upper strata task and then acquisition fault redundance management;
2) task failure redundancy in node:Management program in cluster on each node is responsible for safeguarding the task on this node Information, fault redundance is carried out to being run on this node for task, and is responsible for arriving the mission bit stream synchronized update on this node On cluster management node;
3) mission bit stream is synchronous:The mission bit stream synchronized update on this node is aggregated into cluster pipe by each node in cluster Manage on node;
4) task failure redundancy between node:Management program on cluster management node is responsible for safeguarding the task in whole cluster Information, fault redundance carrying out node to the task of failure, by the mission bit stream synchronized update on cluster management node to spare Node;Between node after fault redundance success, task recovery successful information is synchronous at once from cluster management node to malfunctioning node;
5) cluster management node is elected:There are more cluster management standby nodes in cluster, when cluster management node failure When, an available node is elected from spare management node immediately and cluster management function is externally provided, reach cluster management section Point failure redundancy;
6) mission bit stream standby redundancy:Mission bit stream synchronized update on cluster management node is on standby node;
7) fault redundance management is exited by exiting interface:Described exit when interface is supplied to upper strata task to exit is called, Mission bit stream is deleted out of fault redundance task queue, upper strata task and then exits fault redundance management.
A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, the step 2) In, the maintenance to mission bit stream includes the renewal to mission bit stream and delete operation;By polling mode come Detection task whether Failure, when task failure occurs, to task recovery in node, limits if recovering number and exceeding configured number, Task failure redundancy fails in node.
A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, in the step 3) mission bit stream is from every node to cluster management node periodic synchronous in;After the failure of node internal fault redundancy, mission bit stream It is synchronous at once from malfunctioning node to cluster management node.
A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, the step 4) In maintenance to mission bit stream include addition, renewal and delete operation to mission bit stream;After the failure of node internal fault redundancy, Fault redundance between progress node, cluster management node first looks at whether task has the starter node set specified, if then The node of a node load minimum is selected to recover the task in node set is specified, if not specifying node set, The node of a node load minimum is selected to recover the task in the cluster.
A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, the step 5) Middle cluster management node periodically sends heartbeat multicast message, and standby node periodically receives the heartbeat message;When spare After node does not receive heartbeat message more than certain time, cluster management node failure is judged, standby node is upgraded to cluster management section Point;After a certain cluster management node receives other clustered node management node heartbeat messages, present node can be with sending heartbeat Message node carries out IP address integer value and compares, and when the machine numerical value is big, then the machine is reduced to standby node and stops sending heartbeat report Text, when the machine numerical value is small, continues to send heartbeat message until not receiving the heart of other cluster management nodes within a certain period of time Message is jumped, then the node is saved as new cluster management node and by each of the cluster management node information notification to cluster Point.
A kind of foregoing distributed task scheduling fault redundance method suitable for group system, it is characterized in that, the step 6) Middle cluster management node periodically by mission bit stream synchronized update to standby node, and when have task status occur addition or Exit or during the change of failure at once by job change synchronizing information renewal to standby node.
The beneficial effect that the present invention is reached:1st, the reliability of task improves, distributed task scheduling operation troubles in the cluster When can recover in time in node, between node, improve the reliability of cluster distributed task;2nd, the availability of system carries Height, management program employs master-slave redundancy technology, and task failure Redundancy Management is transparent for a user, and user makes The presence of imperceptible task failure redundancy during;3rd, it is portable good, it is not necessary to be carried by any operating system soft Part;4th, there is cross-platform ability, service routine can be deployed on different operating system servers;5th, using simple, user Only need to call several interfaces that fault redundance can be used;6th, deployment is simple, it is only necessary to which disposing management program, dynamic base can transport OK.
Brief description of the drawings
Fig. 1 is task state transition figure in the present invention;
Fig. 2 is communication scheme between interior joint of the present invention;
Fig. 3 is cluster management node election algorithm flow chart in the present invention;
Embodiment
The invention will be further described below in conjunction with the accompanying drawings.Following embodiments are only used for clearly illustrating the present invention Technical solution, and be not intended to limit the protection scope of the present invention and limit the scope of the invention.
The present invention relates to a kind of distributed task scheduling fault redundance method suitable for group system, comprise the following steps:
1) fault redundance management is received by external interface:External interface is supplied to upper strata task call, by mission bit stream It is added in fault redundance task queue, upper strata task and then acquisition fault redundance management.
2) task failure redundancy in node:Management program in cluster on each node is responsible for safeguarding the task on this node Information, fault redundance is carried out to being run on this node for task, and is responsible for arriving the mission bit stream synchronized update on this node On cluster management node;Maintenance to mission bit stream includes the renewal to mission bit stream and delete operation;By polling mode come Detection task whether failure, when task failure occurs, to task recovery in node, if recovering number exceedes what is configured Number limits, then task failure redundancy fails in node.
3) mission bit stream is synchronous:The mission bit stream synchronized update on this node is aggregated into cluster pipe by each node in cluster Manage on node;Mission bit stream is from every node to cluster management node periodic synchronous;After the failure of node internal fault redundancy, task Information is synchronous at once from malfunctioning node to cluster management node.
4) task failure redundancy between node:Management program on cluster management node is responsible for safeguarding the task in whole cluster Information, fault redundance carrying out node to the task of failure, by the mission bit stream synchronized update on cluster management node to spare Node;Between node after fault redundance success, task recovery successful information is synchronous at once from cluster management node to malfunctioning node;It is right The maintenance of mission bit stream includes addition, renewal and delete operation to mission bit stream;After the failure of node internal fault redundancy, carry out Fault redundance between node, cluster management node first look at whether task has the starter node set specified, if then referring to Determine to select the node of a node load minimum to recover the task in node set, if not specifying node set, collecting The node of a node load minimum is selected to recover the task in group.
5) cluster management node is elected:There are more cluster management standby nodes in cluster, when cluster management node failure When, an available node is elected from spare management node immediately and cluster management function is externally provided, reach cluster management section Point failure redundancy;Cluster management node periodically sends heartbeat multicast message, and standby node periodically receives the heartbeat report Text;After standby node does not receive heartbeat message more than certain time, cluster management node failure is judged, standby node is upgraded to collection Group's management node;After a certain cluster management node receives other clustered node management node heartbeat messages, present node can be with Send heartbeat message node progress IP address integer value to compare, when the machine numerical value is big, then the machine is reduced to standby node and stops sending out Heartbeat message is sent, when the machine numerical value is small, continues to send heartbeat message until not receiving other cluster managements within a certain period of time The heartbeat message of node, then the node is as new cluster management node and by the cluster management node information notification to cluster Each node.
6) mission bit stream standby redundancy:Mission bit stream synchronized update on cluster management node is on standby node;Cluster Management node periodically by mission bit stream synchronized update to standby node, and when have task status occur addition or exit or Job change synchronizing information is updated to standby node at once during the change of failure.
7) fault redundance management is exited by exiting interface:Described exit when interface is supplied to upper strata task to exit is called, Mission bit stream is deleted out of fault redundance task queue, upper strata task and then exits fault redundance management.
As shown in Figure 1, after calling registration interface to succeed in registration after task start, task is normal condition;This method can week Phase property carries out fault detect to normal tasks, and task normally then waits next cycle detection;If task is detected as failure, appoint It is engaged in as malfunction;This method carries out node internal fault redundancy to failure task, if it is successful, task returns to normal condition, If it fails, fault redundance is carried out then carrying out node to failure task, if it is successful, task returns to normal condition, if Fail, then mission failure.Called after task run and exit interface, task is to exit state.
As shown in Fig. 2, when there is no management node in cluster, spare management node can send management node EB packet into The election of row management node.After electing successfully, management node can periodically send heartbeat message, then task run node can be with Know management node address, spare management node can know the state of management node.Task run node can be to management node Periodic transmission task collects message, and management node can be to spare management node signalling of bouquet task to received mission bit stream Backup message, to reach the progress cluster task information backup redundancy purpose in spare management node.When task status occurs During change, such as:From being normally changed into failure, from normally being changed into exiting.Task run node can send task urgent messages always To management node, stop transmission task urgent messages after the task emergency answering message of management node return is received;For pipe Node is managed, after task urgent messages are received, spare management node can be transmitted to, when the receiving spare management node return of the task Stopped forwarding after emergency answering message.
As shown in figure 3, after management program starts, collection can be received always in 4T (1T cluster management nodes heart beat message cycles) Group's management node heartbeat message.If receiving message in 4T, the role of the machine is the spare management node of cluster;If in 4T In do not receive yet, then the machine is upgraded to cluster management node, and sends heartbeat message always in 4T.If receive other cluster pipes The heartbeat message of node is managed, then judges whether the machine will be reduced to the spare management node of cluster according to certain algorithm;Otherwise continue Heartbeat message is sent within the time of 4T, until can not receive other cluster management nodes heart beat messages in 4T or the machine is reduced to collection The spare management node of group.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, some improvement and deformation can also be made, these are improved and deformation Also it should be regarded as protection scope of the present invention.

Claims (4)

1. a kind of distributed task scheduling fault redundance method suitable for group system, it is characterized in that, specifically include following steps:
1)Fault redundance management is received by external interface:The external interface is supplied to upper strata task call, by mission bit stream It is added in fault redundance task queue, upper strata task and then acquisition fault redundance management;
2)Task failure redundancy in node:Management program in cluster on each node is responsible for safeguarding the task letter on this node Breath, fault redundance is carried out to being run on this node for task, and is responsible for the mission bit stream synchronized update on this node to collection In group's management node;Maintenance to mission bit stream includes the renewal to mission bit stream and delete operation;Examined by polling mode Survey task whether failure, when task failure occurs, to task recovery in node, if recovering number exceedes time configured Number limitation, then task failure redundancy fails in node;
3)Mission bit stream is synchronous:The mission bit stream synchronized update on this node is aggregated into cluster management section by each node in cluster Point on;
4)Task failure redundancy between node:Management program on cluster management node is responsible for safeguarding the task letter in whole cluster Breath, fault redundance carrying out node to the task of failure, by the mission bit stream synchronized update on cluster management node to spare section Point;Between node after fault redundance success, task recovery successful information is synchronous at once from cluster management node to malfunctioning node, right The maintenance of mission bit stream includes addition, renewal and delete operation to mission bit stream;After the failure of node internal fault redundancy, carry out Fault redundance between node, cluster management node first look at whether task has the starter node set specified, if then referring to Determine to select the node of a node load minimum to recover the task in node set, if not specifying node set, collecting The node of a node load minimum is selected to recover the task in group;
5)Cluster management node is elected:There are more cluster management standby nodes in cluster, when cluster management node failure, stand An available node is elected from spare management node and cluster management function is externally provided, reach cluster management node failure Redundancy;
6)Mission bit stream standby redundancy:Mission bit stream synchronized update on cluster management node is on standby node;
7)Fault redundance management is exited by exiting interface:Described exit when interface is supplied to upper strata task to exit is called, and will be appointed Business information is deleted out of fault redundance task queue, upper strata task and then exits fault redundance management.
2. a kind of distributed task scheduling fault redundance method suitable for group system according to claim 1, it is characterized in that, In the step 3)Middle mission bit stream is from every node to cluster management node periodic synchronous;Node internal fault redundancy fails Afterwards, mission bit stream is synchronous at once from malfunctioning node to cluster management node.
3. a kind of distributed task scheduling fault redundance method suitable for group system according to claim 1, it is characterized in that, The step 5)Middle cluster management node periodically sends heartbeat multicast message, and standby node periodically receives the heartbeat report Text;After standby node does not receive heartbeat message more than certain time, cluster management node failure is judged, standby node is upgraded to collection Group's management node;After a certain cluster management node receives other clustered node management node heartbeat messages, present node can be with Send heartbeat message node progress IP address integer value to compare, when the machine numerical value is big, then the machine is reduced to standby node and stops sending out Heartbeat message is sent, when the machine numerical value is small, continues to send heartbeat message until not receiving other cluster managements within a certain period of time The heartbeat message of node, then the node is as new cluster management node and by the cluster management node information notification to cluster Each node.
4. a kind of distributed task scheduling fault redundance method suitable for group system according to claim 1, it is characterized in that, The step 6)Middle cluster management node periodically by mission bit stream synchronized update to standby node, and when there is task status Addition occurs or exits or during the change of failure at once by job change synchronizing information renewal to standby node.
CN201510528462.5A 2015-08-25 2015-08-25 A kind of distributed task scheduling fault redundance method suitable for group system Active CN105095008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510528462.5A CN105095008B (en) 2015-08-25 2015-08-25 A kind of distributed task scheduling fault redundance method suitable for group system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510528462.5A CN105095008B (en) 2015-08-25 2015-08-25 A kind of distributed task scheduling fault redundance method suitable for group system

Publications (2)

Publication Number Publication Date
CN105095008A CN105095008A (en) 2015-11-25
CN105095008B true CN105095008B (en) 2018-04-17

Family

ID=54575510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510528462.5A Active CN105095008B (en) 2015-08-25 2015-08-25 A kind of distributed task scheduling fault redundance method suitable for group system

Country Status (1)

Country Link
CN (1) CN105095008B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106789941B (en) * 2016-11-30 2019-12-03 国电南瑞科技股份有限公司 A kind of database and the implementation method of system application heartbeat unified management
CN108170375B (en) * 2017-12-21 2020-12-18 创新科技术有限公司 Overrun protection method and device in distributed storage system
CN110798339A (en) * 2019-10-09 2020-02-14 国电南瑞科技股份有限公司 Task disaster tolerance method based on distributed task scheduling framework
CN112346837A (en) * 2020-10-28 2021-02-09 常州微亿智造科技有限公司 Distributed timer system under industrial Internet of things
CN112838965B (en) * 2021-02-19 2023-03-28 浪潮云信息技术股份公司 Method for identifying and recovering strong synchronization role fault
CN113794595A (en) * 2021-09-15 2021-12-14 领云悠逸(北京)科技有限公司 IoT (Internet of things) equipment high-availability method based on industrial Internet
CN114039978B (en) * 2022-01-06 2022-03-25 天津大学四川创新研究院 Decentralized PoW computing power cluster deployment method
CN114968947B (en) * 2022-03-01 2023-05-09 华为技术有限公司 Fault file storage method and related device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719831A (en) * 2005-07-15 2006-01-11 清华大学 High-available distributed boundary gateway protocol system based on cluster router structure
CN102073546A (en) * 2010-12-13 2011-05-25 北京航空航天大学 Task-dynamic dispatching method under distributed computation mode in cloud computing environment
CN103580915A (en) * 2013-09-26 2014-02-12 东软集团股份有限公司 Method and device for determining main control node of trunking system
CN104461752A (en) * 2014-11-21 2015-03-25 浙江宇视科技有限公司 Two-level fault-tolerant multimedia distributed task processing method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8621463B2 (en) * 2011-09-09 2013-12-31 GM Global Technology Operations LLC Distributed computing architecture with dynamically reconfigurable hypervisor nodes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719831A (en) * 2005-07-15 2006-01-11 清华大学 High-available distributed boundary gateway protocol system based on cluster router structure
CN102073546A (en) * 2010-12-13 2011-05-25 北京航空航天大学 Task-dynamic dispatching method under distributed computation mode in cloud computing environment
CN103580915A (en) * 2013-09-26 2014-02-12 东软集团股份有限公司 Method and device for determining main control node of trunking system
CN104461752A (en) * 2014-11-21 2015-03-25 浙江宇视科技有限公司 Two-level fault-tolerant multimedia distributed task processing method

Also Published As

Publication number Publication date
CN105095008A (en) 2015-11-25

Similar Documents

Publication Publication Date Title
CN105095008B (en) A kind of distributed task scheduling fault redundance method suitable for group system
CN103744809B (en) Vehicle information management system double hot standby method based on VRRP
CN104506357B (en) A kind of high-availability cluster node administration method
CN101150430B (en) A method for realizing network interface board switching based heartbeat mechanism
CN108712501B (en) Information sending method and device, computing equipment and storage medium
US20150365320A1 (en) Method and device for dynamically switching gateway of distributed resilient network interconnect
CN101207517B (en) Method for reliability maintenance of distributed enterprise service bus node
CN102394914A (en) Cluster brain-split processing method and device
JP5343436B2 (en) Information management system
CN104320311A (en) Heartbeat detection method of SCADA distribution type platform
CN105630589A (en) Distributed process scheduling system and process scheduling and execution method
CN102025562A (en) Path detection method and device
CN110677282B (en) Hot backup method of distributed system and distributed system
CN104980693A (en) Media service backup method and system
CN104077181A (en) Status consistent maintaining method applicable to distributed task management system
CN101777020A (en) Fault tolerance method and system used for distributed program
WO2012012962A1 (en) Disaster tolerance service system and disaster tolerance method
CN101729426A (en) Method and system for quickly switching between master device and standby device of virtual router redundancy protocol (VRRP)
CN101610188A (en) Sip server restoring method of service process fault and sip server
CN107682169A (en) A kind of method and apparatus using Kafka collection pocket transmission message
CN105959078A (en) Cluster time synchronization method, cluster and time synchronization system
CN108197222A (en) A kind of restorative procedure, system and the relevant apparatus of exception flow data
CN104796283B (en) A kind of method of monitoring alarm
CN110753002B (en) Traffic scheduling method and device
US10205630B2 (en) Fault tolerance method for distributed stream processing system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20161102

Address after: High road high tech Zone of Nanjing City, Jiangsu Province, No. 20 210032

Applicant after: NARI Technology Development Co., Ltd.

Applicant after: SGCC NARI Nanjing Control System Co., Ltd.

Applicant after: State Grid Corporation of China

Applicant after: STATE GRID JIANGSU ELECTRIC POWER COMPANY

Address before: High road high tech Zone of Nanjing City, Jiangsu Province, No. 20 210032

Applicant before: NARI Technology Development Co., Ltd.

Applicant before: SGCC NARI Nanjing Control System Co., Ltd.

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20161212

Address after: High road high tech Zone of Nanjing City, Jiangsu Province, No. 20 210032

Applicant after: NARI Technology Development Co., Ltd.

Applicant after: SGCC NARI Nanjing Control System Co., Ltd.

Applicant after: State Power Networks Co

Applicant after: STATE GRID JIANGSU ELECTRIC POWER COMPANY

Applicant after: Nanjing Nari Group Corporation

Address before: High road high tech Zone of Nanjing City, Jiangsu Province, No. 20 210032

Applicant before: NARI Technology Development Co., Ltd.

Applicant before: SGCC NARI Nanjing Control System Co., Ltd.

Applicant before: State Power Networks Co

Applicant before: STATE GRID JIANGSU ELECTRIC POWER COMPANY

GR01 Patent grant
GR01 Patent grant