CN104483828A - Distributed fault tolerance computer member consistency ensuring method - Google Patents

Distributed fault tolerance computer member consistency ensuring method Download PDF

Info

Publication number
CN104483828A
CN104483828A CN201410734530.9A CN201410734530A CN104483828A CN 104483828 A CN104483828 A CN 104483828A CN 201410734530 A CN201410734530 A CN 201410734530A CN 104483828 A CN104483828 A CN 104483828A
Authority
CN
China
Prior art keywords
node
mem
members list
data
successful
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410734530.9A
Other languages
Chinese (zh)
Inventor
徐奡
刘帅
李鹏
郑久寿
马小博
程俊强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AVIC No 631 Research Institute
Original Assignee
AVIC No 631 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AVIC No 631 Research Institute filed Critical AVIC No 631 Research Institute
Priority to CN201410734530.9A priority Critical patent/CN104483828A/en
Publication of CN104483828A publication Critical patent/CN104483828A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a distributed fault tolerance computer member consistency ensuring method, which comprises the following steps that (1) the fault tolerance computer node state is initialized: the initial state of an N-node fault tolerance computer is set to be (Ai, N+1-I, 0, A1A2 to AN), the Ai is the i-th node, the (N+1-i) is the initial AC (affirming counter) value, 0 is the initial FC (failure counter) value, A1A2 to AN is an initial member list, i.e., the member list of each node in the initial state comprises all nodes of a system; (2) the node Ai sequentially broadcasts data frames to all nodes, and the Ai member list is recorded to be mem(Ai). The distributed fault tolerance computer member consistency ensuring method has the advantages that by aiming at a distributed computer system fault tolerance technology, the distributed system redundancy management problem is solved, each node state of the system can be reliably managed, and the occurrence of system member clique forming is effectively avoided, so that the system can make the consistent response in time on the fault, and the effective support is provided for the novel fault tolerance strategy of a machine-borne distributed fault tolerance computer.

Description

A kind of distributed fault-tolerance computing machine member consistance ensuring method
Technical field
The invention belongs to distributed fault-tolerance Computer System Design technical field, is the conforming ensuring method of member in a kind of distributed fault-tolerance computer system.
Background technology
Flight control computer system is as the core component of flight control system, and its security, reliability directly have influence on the viability of aircraft.Flight control computer system, as typical airborne fault-tolerant computer system, experienced by the development from the distributed flight control computer system of centralized fax flight control computer system, bus communication to the distributed flight control computer system based on switching network.
Along with the development of flight control computer system architecture, its fault-tolerant strategy is also at development.The centralized fault-tolerant computer system of tradition adopts the fault-tolerant strategies such as voting monitoring, fault masking, resource switch, the mode completion system management functions such as new distribution type fault-tolerant computer system then adopts high integrality computational resource, node failure is mourned in silence, member's consistency protocol, function backup.
Summary of the invention:
In order to solve technical matters existing in background technology, the invention provides the implementation method that between a kind of distributed fault-tolerance computer node, member's consistance ensures, being applicable to the redundancy management of novel airborne distributed fault-tolerance computing machine.
Technical solution of the present invention is: a kind of distributed fault-tolerance computing machine member consistance ensuring method, is characterized in that: said method comprising the steps of:
1) initialization of fault-tolerant computer node state: the fault-tolerant computer of N node, original state is set to (Ai, N+1-i, 0, A1A2 ... AN), Ai is i-th node, N+1-i is for initially to confirm counter (AC) value, 0 is initial fail counter (FC) value, A1A2 ... AN is initial members list, namely under original state each node members list in comprise all nodes of system;
2) node Ai is in order to all node broadcasts Frames; Ai members list is designated as mem (Ai); Ai judges whether local AC is greater than FC, if result is true, node Ai resets local AC and FC, use local members list and data joint account CRC check to be sent and, after obtaining CRC, itself and outgoing data are formed Frame, this Frame is broadcast to all nodes (comprising self); If result is false, and node feeding back mistake is to upper layer application and enter frozen state;
3) node Ak receives and decoded data frame; Node Ak receives the Frame that Ai sends, and uses local members list to decode and CRC check to receiving data frames; The successful node of CRC check thinks that data correctly receive, and the node of CRC check failure thinks that data frame receipt is failed; If node Ak correctly receives data, then Ai is added local members list by Ak, and cumulative AC; If node Ak receives data failure, then Ai deletes from local members list by Ak, and cumulative FC; If node Ak does not put the Frame receiving Ai and send in expeced time, then Ai is deleted from local members list, but cumulative any counter;
4) node Ai finds the first successful node after sending data; After node Ai sends data, wait for the next node correct data frame that section is sent in expeced time; Judge whether to meet: mem (Ai)=mem (Aj), result is that very then Aj is first successful node of Ai, and Ai is identified (namely Ai is correct), and implicit confirmation algorithm terminates;
Said method also comprises step 5) node Ai find the first successful node but not confirmed time, find the second successful node; If the expeced time section of Ai after Aj correctly receives the Frame that node Am sends, if judge Ai ∈ mem (Am) and Aj ∈ mem (Am) only and only have one to be true, and { mem (Am)-Ai-Aj}={mem (Ai)-Ai-Aj} is true, and Ai is using second successful node of Am as Ai.
Said method also comprises step 6) correctness of Ai and Aj state is judged according to the second successful node; If Ai ∈ mem (Am), Ai is identified (namely Ai is correct), Aj mistake, and Ai adds up its AC, and is deleted from the members list of Ai by Aj, implicitly confirms that algorithm terminates; If Aj ∈ mem (Am), then Ai mistake, Aj is correct, and self deletes from members list by Ai, and Aj and Am is added local members list, and implicit confirmation algorithm terminates.
Above-mentioned steps 4) if mem (Aj), Ai}=mem (Ai) they are true, then Aj is first successful node of Ai, but Ai not confirmed; Non-NULL frame condition will add up FC.
Advantage of the present invention is:
1) achieve the member management function of distributed fault-tolerance computing machine, algorithm is flexible, and extendability is strong, provides effect technique support the redundancy management of the distributed fault-tolerance computing machine of novel open type framework.
2) can effectively avoid the member of distributed structure/architecture interior joint to clique problem, algorithm reliability, security are high.
3) adopt the CRC check comprising the Frame of local members list in member's guarantee process and come data encoding, do not need in node communication really to exchange members list, do not need extra occupied bandwidth, only need a small amount of computing time, resource occupation is few.
The present invention is on the high integrality Distributed Computing Platform of determinacy bus communication, node failure is adopted to mourn in silence technology, devise a kind of distributed fault-tolerance computing machine member consistance ensuring method, the method can reliably manage each node state of system, effectively avoid the generation that DBMS member cliques, all message is set up to the copy consistency determined, system can be made to make fault consistently respond in time, and have larger dirigibility and extendability, for the New Fault-tolerant strategy of airborne distributed fault-tolerance computing machine provides effective support.
Accompanying drawing illustrates:
Fig. 1 is the present invention's implicit confirmation algorithm flow.
Fig. 2 is decision making algorithm flow process of the present invention.
Embodiment
An in store members list on each node of distributed system.Have recorded the running status of all nodes in list, in each cycle, any node all according to the members list of the information updating this locality received, can confirm by member mutual between node the consistance that ensure that all nodes when receiving message.Thus guarantee the validity of voting result.The method is primarily of decision making algorithm and implicit confirmation algorithm realization.
Decision making algorithm: each node maintenance local members list.When a node is ready for sending data, self adds in local members list by it; When a node receives a correct Frame, sending node is joined local members list by it, data transmission correctly refers to: time point (2) transmission success that (1) transmission must occur in expection completes (3) after transmit leg is joined take over party members list, and the members list of both sides must be consistent; And when taking defeat or do not receive Frame, receiving node deletes the node that this period sends data from members list.
Implicit confirmation algorithm: this algorithm introduces the concept of the first successful node and the second successful node.After a certain node (as A) sends data, wait for next node (as B) the correct message that section is sent in expeced time, if A, B have identical members list, or except not comprising A, other are every all identical with the members list of A in the members list of B, then A using B as the first successful node, A is identified, and implicit confirmation algorithm terminates; Otherwise for A sends mistake or B reception mistake, for judging in both cases, A waits for the second successful node, if the expeced time section of node C after B have sent the correct frame of a form, and the members list of C comprises and only comprises in A and B, and it is identical with the members list of other node in A members list except A, B, then A using C as the second successful node, if the members list of C comprises A, then A is identified, B is considered to the node of mistake, and B is deleted from members list by A; If the members list of C does not comprise A, then A mistake, so self deletes from members list by A, and adds members list by B and C, and implicit confirmation algorithm terminates.
See Fig. 1, Fig. 2, below the present invention is described in further details.
(1) initialization of fault-tolerant computer node state.The fault-tolerant computer of N node, original state is set to (A i, N+1-i, 0, A 1a 2a n), A ibe i-th node, N+1-i is for initially to confirm counter (AC) value, and 0 is initial fail counter (FC) value, A 1a 2a nfor initial members list, namely under original state each node members list in comprise all nodes of system.
(2) node A iin order to all node broadcasts Frames.A imembers list is designated as mem (A i), lower same.A ijudge whether local AC is greater than FC, if result is true, node A ireset local AC and FC, use local members list and data joint account CRC check to be sent and, after obtaining CRC, itself and outgoing data are formed Frame, this Frame are broadcast to all nodes (comprising self); If result is false, and node feeding back mistake is to upper layer application and enter frozen state.
(3) node A kreceive and decoded data frame.Node A kreceive A ithe Frame sent, uses local members list to decode and CRC check to receiving data frames.The successful node of CRC check thinks that data correctly receive, and the node of CRC check failure thinks that data frame receipt is failed.If node A kcorrect reception data, then A kby A iadd local members list, and cumulative AC; If node A kreceive data failure, then A kby A idelete from local members list, and cumulative FC; If node A kdo not put in expeced time and receive A ithe Frame sent, then by A idelete from local members list, but not cumulative any counter.
(4) node A ithe first successful node is found after sending data.After node Ai sends data, wait for the next node correct data frame that section is sent in expeced time.Judge whether to meet: mem (A i)=mem (A j), result is very then A jfor A ithe first successful node, A ibe identified (namely Ai is correct), implicit confirmation algorithm terminates; If { mem (A j), A i}=mem (A i) be true, then A jfor A ithe first successful node, but A inot confirmed; Other (non-NULL frame) situations will add up FC.
(5) node A ifind the first successful node but not confirmed time, find the second successful node.If A iat A jsection expeced time afterwards correctly receives node A mthe Frame sent, if judge A i∈ mem (A m) and A j∈ mem (A m) only and only have one to be true, and { mem (A m)-A i-A j}={ mem (A i)-A i-A jbe true, A iby A mas A ithe second successful node.
(6) A is judged according to the second successful node iand A jthe correctness of state.If A i∈ mem (A m), A ibe identified (i.e. A icorrectly), A jmistake, A iits AC cumulative, and by A jfrom A imembers list in delete, implicit confirm that algorithm terminates; If A j∈ mem (A m), then A imistake, A jcorrectly, A iself is deleted from members list, and by A jand A madd local members list, implicit confirmation algorithm terminates.

Claims (4)

1. a distributed fault-tolerance computing machine member consistance ensuring method, is characterized in that: said method comprising the steps of:
1) initialization of fault-tolerant computer node state: the fault-tolerant computer of N node, original state is set to (A i, N+1-i, 0, A 1a 2a n), A ibe i-th node, N+1-i is for initially to confirm counter (AC) value, and 0 is initial fail counter (FC) value, A 1a 2a nfor initial members list, namely under original state each node members list in comprise all nodes of system;
2) node A iin order to all node broadcasts Frames; A imembers list is designated as mem (A i); A ijudge whether local AC is greater than FC, if result is true, node A ireset local AC and FC, use local members list and data joint account CRC check to be sent and, after obtaining CRC, itself and outgoing data are formed Frame, this Frame are broadcast to all nodes (comprising self); If result is false, and node feeding back mistake is to upper layer application and enter frozen state;
3) node A kreceive and decoded data frame; Node A kreceive A ithe Frame sent, uses local members list to decode and CRC check to receiving data frames; The successful node of CRC check thinks that data correctly receive, and the node of CRC check failure thinks that data frame receipt is failed; If node A kcorrect reception data, then A kby A iadd local members list, and cumulative AC; If node A kreceive data failure, then A kby A idelete from local members list, and cumulative FC; If node A kdo not put in expeced time and receive A ithe Frame sent, then by A idelete from local members list, but not cumulative any counter;
4) node A ithe first successful node is found after sending data; After node Ai sends data, wait for the next node correct data frame that section is sent in expeced time; Judge whether to meet: mem (A i)=mem (A j), result is very then A jfor A ithe first successful node, A ibe identified (namely Ai is correct), implicit confirmation algorithm terminates.
2. distributed fault-tolerance computing machine member consistance ensuring method according to claim 1, is characterized in that: described method also comprises step 5) node A ifind the first successful node but not confirmed time, find the second successful node; If A iat A jsection expeced time afterwards correctly receives node A mthe Frame sent, if judge A i∈ mem (A m) and A j∈ mem (A m) only and only have one to be true, and { mem (A m)-A i-A j}={ mem (A i)-A i-A jbe true, A iby A mas A ithe second successful node.
3. distributed fault-tolerance computing machine member consistance ensuring method according to claim 2, is characterized in that: described method also comprises step 6) judge A according to the second successful node iand A jthe correctness of state; If A i∈ mem (A m), A ibe identified (i.e. A icorrectly), A jmistake, A iits AC cumulative, and by A jfrom A imembers list in delete, implicit confirm that algorithm terminates; If A j∈ mem (A m), then A imistake, A jcorrectly, A iself is deleted from members list, and by A jand A madd local members list, implicit confirmation algorithm terminates.
4. distributed fault-tolerance computing machine member consistance ensuring method according to claim 3, is characterized in that: described step 4) if { mem (A j), A i}=mem (A i) be true, then A jfor A ithe first successful node, but A inot confirmed; Non-NULL frame condition will add up FC.
CN201410734530.9A 2014-12-04 2014-12-04 Distributed fault tolerance computer member consistency ensuring method Pending CN104483828A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410734530.9A CN104483828A (en) 2014-12-04 2014-12-04 Distributed fault tolerance computer member consistency ensuring method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410734530.9A CN104483828A (en) 2014-12-04 2014-12-04 Distributed fault tolerance computer member consistency ensuring method

Publications (1)

Publication Number Publication Date
CN104483828A true CN104483828A (en) 2015-04-01

Family

ID=52758386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410734530.9A Pending CN104483828A (en) 2014-12-04 2014-12-04 Distributed fault tolerance computer member consistency ensuring method

Country Status (1)

Country Link
CN (1) CN104483828A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107707595A (en) * 2017-03-17 2018-02-16 贵州白山云科技有限公司 A kind of member organizes variation and device
CN108959140A (en) * 2017-05-25 2018-12-07 南京航空航天大学 TTP and AFDX adapter based on MPC555 and AN8202
CN109714198A (en) * 2018-12-14 2019-05-03 中国航空工业集团公司西安航空计算技术研究所 A kind of mixed structure network distribution type fault-tolerant computer system fault tolerance management method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘双与等: "TTP/C协议的一致性机制研究", 《计算机工程》 *
陈珉等: "分布式数据库系统中数据一致性维护方法研究", 《国防科技大学学报》 *
龙慧等: "多智能体系统分布式一致性算法研究现状", 《计算机工程与应用》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107707595A (en) * 2017-03-17 2018-02-16 贵州白山云科技有限公司 A kind of member organizes variation and device
CN107707595B (en) * 2017-03-17 2018-06-15 贵州白山云科技有限公司 A kind of member organizes variation and device
CN108959140A (en) * 2017-05-25 2018-12-07 南京航空航天大学 TTP and AFDX adapter based on MPC555 and AN8202
CN109714198A (en) * 2018-12-14 2019-05-03 中国航空工业集团公司西安航空计算技术研究所 A kind of mixed structure network distribution type fault-tolerant computer system fault tolerance management method
CN109714198B (en) * 2018-12-14 2022-03-15 中国航空工业集团公司西安航空计算技术研究所 Fault-tolerant management method for distributed fault-tolerant computer system of mixed structure network

Similar Documents

Publication Publication Date Title
CN204859222U (en) With two high available systems that live of city data center
CN107729366A (en) A kind of pervasive multi-source heterogeneous large-scale data synchronization system
CN103647830B (en) The dynamic management approach of multi-level configuration file in a kind of cluster management system
CN103795754A (en) Method and system for data synchronization among multiple systems
CN103701913B (en) Data synchronization method and device
CN103778031A (en) Distributed system multilevel fault tolerance method under cloud environment
CN104320459A (en) Node management method and device
CN105069152B (en) data processing method and device
CN103944974B (en) A kind of protocol message processing method, controller failure processing method and relevant device
CN102427412A (en) Zero-delay disaster recovery switching method and system of active standby source based on content distribution network
JP6431197B2 (en) Snapshot processing methods and associated devices
CN103888277A (en) Gateway disaster recovery backup method, apparatus and system
CN103209210A (en) Method for improving erasure code based storage cluster recovery performance
CN105553682B (en) Event notification method and the system notified for event
WO2017097006A1 (en) Real-time data fault-tolerance processing method and system
CN104483828A (en) Distributed fault tolerance computer member consistency ensuring method
CN107682411A (en) A kind of extensive SDN controllers cluster and network system
CN105072021A (en) Cross-segment message forwarding method for dispatching automation systems
CN102984174B (en) Reliability guarantee method and system in a kind of distribution subscription system
Cheraghlou et al. A novel fault-tolerant leach clustering protocol for wireless sensor networks
CN107357800A (en) A kind of database High Availabitity zero loses solution method
CN108445857B (en) Design method for 1+ N redundancy mechanism of SCADA system
CN101902382B (en) Ethernet single ring network address refreshing method and system
CN102045187A (en) Method and equipment for realizing HA (high-availability) system with checkpoints
CN103559188B (en) Metadata management method and management system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150401

RJ01 Rejection of invention patent application after publication