CN104483828A

CN104483828A - Distributed fault tolerance computer member consistency ensuring method

Info

Publication number: CN104483828A
Application number: CN201410734530.9A
Authority: CN
Inventors: 徐奡; 刘帅; 李鹏; 郑久寿; 马小博; 程俊强
Original assignee: AVIC No 631 Research Institute
Current assignee: AVIC No 631 Research Institute
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2015-04-01

Abstract

The invention provides a distributed fault tolerance computer member consistency ensuring method, which comprises the following steps that (1) the fault tolerance computer node state is initialized: the initial state of an N-node fault tolerance computer is set to be (Ai, N+1-I, 0, A1A2 to AN), the Ai is the i-th node, the (N+1-i) is the initial AC (affirming counter) value, 0 is the initial FC (failure counter) value, A1A2 to AN is an initial member list, i.e., the member list of each node in the initial state comprises all nodes of a system; (2) the node Ai sequentially broadcasts data frames to all nodes, and the Ai member list is recorded to be mem(Ai). The distributed fault tolerance computer member consistency ensuring method has the advantages that by aiming at a distributed computer system fault tolerance technology, the distributed system redundancy management problem is solved, each node state of the system can be reliably managed, and the occurrence of system member clique forming is effectively avoided, so that the system can make the consistent response in time on the fault, and the effective support is provided for the novel fault tolerance strategy of a machine-borne distributed fault tolerance computer.

Description

A kind of distributed fault-tolerance computing machine member consistance ensuring method

Technical field

The invention belongs to distributed fault-tolerance Computer System Design technical field, is the conforming ensuring method of member in a kind of distributed fault-tolerance computer system.

Background technology

Flight control computer system is as the core component of flight control system, and its security, reliability directly have influence on the viability of aircraft.Flight control computer system, as typical airborne fault-tolerant computer system, experienced by the development from the distributed flight control computer system of centralized fax flight control computer system, bus communication to the distributed flight control computer system based on switching network.

Along with the development of flight control computer system architecture, its fault-tolerant strategy is also at development.The centralized fault-tolerant computer system of tradition adopts the fault-tolerant strategies such as voting monitoring, fault masking, resource switch, the mode completion system management functions such as new distribution type fault-tolerant computer system then adopts high integrality computational resource, node failure is mourned in silence, member's consistency protocol, function backup.

Summary of the invention:

In order to solve technical matters existing in background technology, the invention provides the implementation method that between a kind of distributed fault-tolerance computer node, member's consistance ensures, being applicable to the redundancy management of novel airborne distributed fault-tolerance computing machine.

Technical solution of the present invention is: a kind of distributed fault-tolerance computing machine member consistance ensuring method, is characterized in that: said method comprising the steps of:

1) initialization of fault-tolerant computer node state: the fault-tolerant computer of N node, original state is set to (Ai, N+1-i, 0, A1A2 ... AN), Ai is i-th node, N+1-i is for initially to confirm counter (AC) value, 0 is initial fail counter (FC) value, A1A2 ... AN is initial members list, namely under original state each node members list in comprise all nodes of system;

2) node Ai is in order to all node broadcasts Frames; Ai members list is designated as mem (Ai); Ai judges whether local AC is greater than FC, if result is true, node Ai resets local AC and FC, use local members list and data joint account CRC check to be sent and, after obtaining CRC, itself and outgoing data are formed Frame, this Frame is broadcast to all nodes (comprising self); If result is false, and node feeding back mistake is to upper layer application and enter frozen state;

3) node Ak receives and decoded data frame; Node Ak receives the Frame that Ai sends, and uses local members list to decode and CRC check to receiving data frames; The successful node of CRC check thinks that data correctly receive, and the node of CRC check failure thinks that data frame receipt is failed; If node Ak correctly receives data, then Ai is added local members list by Ak, and cumulative AC; If node Ak receives data failure, then Ai deletes from local members list by Ak, and cumulative FC; If node Ak does not put the Frame receiving Ai and send in expeced time, then Ai is deleted from local members list, but cumulative any counter;

4) node Ai finds the first successful node after sending data; After node Ai sends data, wait for the next node correct data frame that section is sent in expeced time; Judge whether to meet: mem (Ai)=mem (Aj), result is that very then Aj is first successful node of Ai, and Ai is identified (namely Ai is correct), and implicit confirmation algorithm terminates;

Said method also comprises step 5) node Ai find the first successful node but not confirmed time, find the second successful node; If the expeced time section of Ai after Aj correctly receives the Frame that node Am sends, if judge Ai ∈ mem (Am) and Aj ∈ mem (Am) only and only have one to be true, and { mem (Am)-Ai-Aj}={mem (Ai)-Ai-Aj} is true, and Ai is using second successful node of Am as Ai.

Said method also comprises step 6) correctness of Ai and Aj state is judged according to the second successful node; If Ai ∈ mem (Am), Ai is identified (namely Ai is correct), Aj mistake, and Ai adds up its AC, and is deleted from the members list of Ai by Aj, implicitly confirms that algorithm terminates; If Aj ∈ mem (Am), then Ai mistake, Aj is correct, and self deletes from members list by Ai, and Aj and Am is added local members list, and implicit confirmation algorithm terminates.

Above-mentioned steps 4) if mem (Aj), Ai}=mem (Ai) they are true, then Aj is first successful node of Ai, but Ai not confirmed; Non-NULL frame condition will add up FC.

Advantage of the present invention is:

1) achieve the member management function of distributed fault-tolerance computing machine, algorithm is flexible, and extendability is strong, provides effect technique support the redundancy management of the distributed fault-tolerance computing machine of novel open type framework.

2) can effectively avoid the member of distributed structure/architecture interior joint to clique problem, algorithm reliability, security are high.

3) adopt the CRC check comprising the Frame of local members list in member's guarantee process and come data encoding, do not need in node communication really to exchange members list, do not need extra occupied bandwidth, only need a small amount of computing time, resource occupation is few.

The present invention is on the high integrality Distributed Computing Platform of determinacy bus communication, node failure is adopted to mourn in silence technology, devise a kind of distributed fault-tolerance computing machine member consistance ensuring method, the method can reliably manage each node state of system, effectively avoid the generation that DBMS member cliques, all message is set up to the copy consistency determined, system can be made to make fault consistently respond in time, and have larger dirigibility and extendability, for the New Fault-tolerant strategy of airborne distributed fault-tolerance computing machine provides effective support.

Accompanying drawing illustrates:

Fig. 1 is the present invention's implicit confirmation algorithm flow.

Fig. 2 is decision making algorithm flow process of the present invention.

Embodiment

An in store members list on each node of distributed system.Have recorded the running status of all nodes in list, in each cycle, any node all according to the members list of the information updating this locality received, can confirm by member mutual between node the consistance that ensure that all nodes when receiving message.Thus guarantee the validity of voting result.The method is primarily of decision making algorithm and implicit confirmation algorithm realization.

Decision making algorithm: each node maintenance local members list.When a node is ready for sending data, self adds in local members list by it; When a node receives a correct Frame, sending node is joined local members list by it, data transmission correctly refers to: time point (2) transmission success that (1) transmission must occur in expection completes (3) after transmit leg is joined take over party members list, and the members list of both sides must be consistent; And when taking defeat or do not receive Frame, receiving node deletes the node that this period sends data from members list.

Implicit confirmation algorithm: this algorithm introduces the concept of the first successful node and the second successful node.After a certain node (as A) sends data, wait for next node (as B) the correct message that section is sent in expeced time, if A, B have identical members list, or except not comprising A, other are every all identical with the members list of A in the members list of B, then A using B as the first successful node, A is identified, and implicit confirmation algorithm terminates; Otherwise for A sends mistake or B reception mistake, for judging in both cases, A waits for the second successful node, if the expeced time section of node C after B have sent the correct frame of a form, and the members list of C comprises and only comprises in A and B, and it is identical with the members list of other node in A members list except A, B, then A using C as the second successful node, if the members list of C comprises A, then A is identified, B is considered to the node of mistake, and B is deleted from members list by A; If the members list of C does not comprise A, then A mistake, so self deletes from members list by A, and adds members list by B and C, and implicit confirmation algorithm terminates.

See Fig. 1, Fig. 2, below the present invention is described in further details.

(1) initialization of fault-tolerant computer node state.The fault-tolerant computer of N node, original state is set to (A _i, N+1-i, 0, A ₁a ₂a _n), A _ibe i-th node, N+1-i is for initially to confirm counter (AC) value, and 0 is initial fail counter (FC) value, A ₁a ₂a _nfor initial members list, namely under original state each node members list in comprise all nodes of system.

(2) node A _iin order to all node broadcasts Frames.A _imembers list is designated as mem (A _i), lower same.A _ijudge whether local AC is greater than FC, if result is true, node A _ireset local AC and FC, use local members list and data joint account CRC check to be sent and, after obtaining CRC, itself and outgoing data are formed Frame, this Frame are broadcast to all nodes (comprising self); If result is false, and node feeding back mistake is to upper layer application and enter frozen state.

(3) node A _kreceive and decoded data frame.Node A _kreceive A _ithe Frame sent, uses local members list to decode and CRC check to receiving data frames.The successful node of CRC check thinks that data correctly receive, and the node of CRC check failure thinks that data frame receipt is failed.If node A _kcorrect reception data, then A _kby A _iadd local members list, and cumulative AC; If node A _kreceive data failure, then A _kby A _idelete from local members list, and cumulative FC; If node A _kdo not put in expeced time and receive A _ithe Frame sent, then by A _idelete from local members list, but not cumulative any counter.

(4) node A _ithe first successful node is found after sending data.After node Ai sends data, wait for the next node correct data frame that section is sent in expeced time.Judge whether to meet: mem (A _i)=mem (A _j), result is very then A _jfor A _ithe first successful node, A _ibe identified (namely Ai is correct), implicit confirmation algorithm terminates; If { mem (A _j), A _i}=mem (A _i) be true, then A _jfor A _ithe first successful node, but A _inot confirmed; Other (non-NULL frame) situations will add up FC.

(5) node A _ifind the first successful node but not confirmed time, find the second successful node.If A _iat A _jsection expeced time afterwards correctly receives node A _mthe Frame sent, if judge A _i∈ mem (A _m) and A _j∈ mem (A _m) only and only have one to be true, and { mem (A _m)-A _i-A _j}={ mem (A _i)-A _i-A _jbe true, A _iby A _mas A _ithe second successful node.

(6) A is judged according to the second successful node _iand A _jthe correctness of state.If A _i∈ mem (A _m), A _ibe identified (i.e. A _icorrectly), A _jmistake, A _iits AC cumulative, and by A _jfrom A _imembers list in delete, implicit confirm that algorithm terminates; If A _j∈ mem (A _m), then A _imistake, A _jcorrectly, A _iself is deleted from members list, and by A _jand A _madd local members list, implicit confirmation algorithm terminates.

Claims

1. a distributed fault-tolerance computing machine member consistance ensuring method, is characterized in that: said method comprising the steps of:

1) initialization of fault-tolerant computer node state: the fault-tolerant computer of N node, original state is set to (A _i, N+1-i, 0, A ₁a ₂a _n), A _ibe i-th node, N+1-i is for initially to confirm counter (AC) value, and 0 is initial fail counter (FC) value, A ₁a ₂a _nfor initial members list, namely under original state each node members list in comprise all nodes of system;

2) node A _iin order to all node broadcasts Frames; A _imembers list is designated as mem (A _i); A _ijudge whether local AC is greater than FC, if result is true, node A _ireset local AC and FC, use local members list and data joint account CRC check to be sent and, after obtaining CRC, itself and outgoing data are formed Frame, this Frame are broadcast to all nodes (comprising self); If result is false, and node feeding back mistake is to upper layer application and enter frozen state;

3) node A _kreceive and decoded data frame; Node A _kreceive A _ithe Frame sent, uses local members list to decode and CRC check to receiving data frames; The successful node of CRC check thinks that data correctly receive, and the node of CRC check failure thinks that data frame receipt is failed; If node A _kcorrect reception data, then A _kby A _iadd local members list, and cumulative AC; If node A _kreceive data failure, then A _kby A _idelete from local members list, and cumulative FC; If node A _kdo not put in expeced time and receive A _ithe Frame sent, then by A _idelete from local members list, but not cumulative any counter;

4) node A _ithe first successful node is found after sending data; After node Ai sends data, wait for the next node correct data frame that section is sent in expeced time; Judge whether to meet: mem (A _i)=mem (A _j), result is very then A _jfor A _ithe first successful node, A _ibe identified (namely Ai is correct), implicit confirmation algorithm terminates.

2. distributed fault-tolerance computing machine member consistance ensuring method according to claim 1, is characterized in that: described method also comprises step 5) node A _ifind the first successful node but not confirmed time, find the second successful node; If A _iat A _jsection expeced time afterwards correctly receives node A _mthe Frame sent, if judge A _i∈ mem (A _m) and A _j∈ mem (A _m) only and only have one to be true, and { mem (A _m)-A _i-A _j}={ mem (A _i)-A _i-A _jbe true, A _iby A _mas A _ithe second successful node.

3. distributed fault-tolerance computing machine member consistance ensuring method according to claim 2, is characterized in that: described method also comprises step 6) judge A according to the second successful node _iand A _jthe correctness of state; If A _i∈ mem (A _m), A _ibe identified (i.e. A _icorrectly), A _jmistake, A _iits AC cumulative, and by A _jfrom A _imembers list in delete, implicit confirm that algorithm terminates; If A _j∈ mem (A _m), then A _imistake, A _jcorrectly, A _iself is deleted from members list, and by A _jand A _madd local members list, implicit confirmation algorithm terminates.

4. distributed fault-tolerance computing machine member consistance ensuring method according to claim 3, is characterized in that: described step 4) if { mem (A _j), A _i}=mem (A _i) be true, then A _jfor A _ithe first successful node, but A _inot confirmed; Non-NULL frame condition will add up FC.