CN111338857A

CN111338857A - Byzantine fault-tolerant consensus protocol

Info

Publication number: CN111338857A
Application number: CN202010087336.1A
Authority: CN
Inventors: 张晴
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2020-06-26

Abstract

The invention discloses a Byzantine fault-tolerant consensus protocol, which comprises three sub-protocols, namely a consensus protocol, a view transformation protocol and a check point protocol; the consensus protocol coordinates each replica node to achieve agreement on the request execution sequence, and determines whether to automatically replace a new master replica node according to the request submission condition; when the duplicate nodes are overtime and can not reach the agreement or the automatic master changing is unsuccessful, the slave duplicate nodes trigger the view conversion protocol, select a new master duplicate node, and execute the consensus protocol again to ensure that the consensus can be achieved certainly; after the distributed system completes a certain number of requests, the logs of the replica nodes are cleaned, and the replica nodes update the states of the replica nodes. The method can realize smooth performance reduction when an error copy (a master copy node or a slave copy node) exists in the system, and solves the problem that the performance of certain BFT protocols is seriously reduced when the error node exists.

Description

Byzantine fault-tolerant consensus protocol

Technical Field

The invention relates to the aspect of copy management in the field of distributed systems, in particular to a Byzantine fault-tolerant consensus protocol.

Background

Today, data centers are becoming larger and larger, with more and more machines, and with an increase in the problems that may be encountered. We often see examples of news reports such as a data center being out of service due to a disaster such as lightning, power outage, etc. Currently, a Replication (Replication) technique is a mainstream solution for solving a Server failure (Server failure) problem. Specifically, the data of one server is copied to a plurality of copies on different machines, so that the system can operate normally as long as the data on part of the servers are not abnormal. While State Machine Replication (SMR) ensures that each operation in the system is performed in the same order on different copies, since all operations are performed in one orderThe execution results of each copy are consistent^[1]. The system thus provides a consistent service to the customer. However, a malicious node may exist in the replica node, and the system is prevented from providing normal service to the outside. In order to resist malicious copy nodes, the system needs to manage the copies by using a Byzantine fault-tolerant algorithm, and the consistency of data copies on all correct nodes in the system is ensured, so that the system achieves the goals of high availability and high reliability.

The consistency Problem that exists in computer networks was addressed by The computer scientist Leslie Lamport et al in 1982 and was called The Byzantine genes Problem or Byzantine Failure^[2]. The problem of the byzantine general describes how to reconcile the attack or withdrawal of a loyal general among traitors' troops. The byzantine assumption is a modeling of the real world, and unpredictable behavior of computers and networks may occur due to hardware errors, network congestion or disconnection, and malicious attacks. The Byzantine fault-tolerant algorithm (Byzantine fault tolerant) can tolerate software errors and security holes in any form, and is a general scheme for solving the fault-tolerant problem of a distributed system^[1]. The Byzantine Fault Tolerance (BFT) algorithm is mainly used for achieving agreement on request sequences to be executed by the replica nodes, and the system can still provide services normally to the outside when f replica nodes have problems. Literature reference^[3]The attestation system requires at least 3f +1 nodes to accommodate the f error nodes.

Researchers put forward practical Byzantine algorithm^[4]The running expense of the Byzantine algorithm is greatly reduced. Compared with the original BFT algorithm, the complexity of the algorithm is reduced from exponential level to polynomial level, so that the practical application of the BFT algorithm is possible. Currently, Byzantine fault-tolerant methods are mainly divided into two categories, namely a quorum-based mode and a master-slave-based mode. In the BFT protocol based on quorum^[5、6、7]In the method, the copy directly processes the received request and replies to the client, and the consistency check is executed by the client. Clearly, the quorum-based BFT protocol performs better in low concurrency situations and worse in high concurrency situations. In contrast, based on the principalPrimary-base BFT protocol^[2、8、9]Before the replica executes the request, the primary replica node needs to allocate a sequence number to the request, then the sequence number is agreed by the secondary replica node, and finally the request is executed and the result is returned to the client. The process of the duplicate agreeing on the sequence number is called consensus. Obviously, under the condition of high concurrency, the consensus protocol can effectively avoid conflict and ensure good performance. But if the master node is the wrong node, it will also cause a dramatic drop in performance.

Researchers have proposed various optimization techniques, such as optimistic pre-execution techniques, in order to design more efficient and robust consensus algorithms^[9、10]Trusted component^[11、12]Periodically changing leader, etc^[13]. The high efficiency means that the common knowledge can be achieved with less expenditure, and the robust means that the reasonable common knowledge expenditure can be still ensured when the wrong node exists. However, many current BFT protocols either seek high efficiency and ignore robustness^[9、10]Either at the expense of efficiency in pursuit of robustness^{[13,、14、15]}。

Disclosure of Invention

The invention provides a Byzantine fault-tolerant consensus protocol, aiming at the problem that the performance of the existing Byzantine fault-tolerant algorithm is seriously reduced under the condition that an error node exists or when the conflict is serious.

A Byzantine fault-tolerant consensus protocol works in a distributed system, wherein the system comprises 3f +1 replica nodes, at most f nodes are error nodes, and f is smaller than that of all replica nodes in the system

The method comprises three sub-protocols of a consensus protocol, a view transformation protocol and a checkpoint protocol;

the consensus protocol coordinates each replica node to achieve agreement on the request execution sequence, and determines whether to automatically replace a new master replica node according to the request submission condition;

when the duplicate nodes are overtime and can not reach the agreement or the automatic master changing is unsuccessful, the slave duplicate nodes trigger the view conversion protocol, select a new master duplicate node, and execute the consensus protocol again to ensure that the consensus can be achieved certainly;

after the distributed system completes a certain number of requests, the logs of the replica nodes are cleaned, and the replica nodes update the states of the replica nodes.

In the consensus protocol, the nodes agree on the request sequence through information exchange, and the consensus protocol comprises the following steps:

and L1, the client c sends a request to all the copies, namely, sends a request message to the copy nodes.

And L2, after receiving the valid client request message, the replica node allocates the next sequence number s to be allocated to the request and sends the pre-prepare message to the primary replica node.

And L3, the main replica node checks the validity of the pre-prefix message after receiving the pre-prefix message, and if the pre-prefix message passes the check, the main replica node receives the message. (1) If 2f +1 consistent pre-prefix messages are received, the messages form a cert message, and the master copy sends the cert message to the slave copy; (2) if the replica fails to receive 2f +1 identical pre-preamble messages, then a cert message and a preamble message are constructed from the received legitimate pre-preamble messages, and the master replica then sends these messages to the slave replica.

L4, checking the validity of the cert message from the copy:

l4-1. no conflict or mild conflict: if the cert message contains 2f +1 consistent pre-prepare messages, the copy may perform the request and reply a reply message to the client.

L4-2. heavy conflict: if the cert message fails to contain 2f +1 consistent pre-preamble messages, the copies construct commit messages from the preamble messages and broadcast to other copies, and then proceed to L5.

L5. if a copy receives 2f +1 matching commit messages from other copies, the request can be executed and a reply message can be returned to the client.

L6. the client receives 2f +1 consistent reply messages, the request is completed.

Where the copy has two opportunities to submit requests:

(1) in the absence of conflicts or light conflicts, the replica may complete the submission of the request at stage L4 even if there is an erroneous node. This means that there are 2f +1 replica nodes submitting the request, meaning that there are at least f +1 correct replica nodes submitting the request; the number 2f +1 can guarantee that the fact that the request is executed under the sequence number does not change, even if there is a wrong node that cannot be changed. (2) When there is a heavy conflict, the method allows the replica to submit the request twice. And the replica nodes attempt to submit the request for the second time in an all-to-all information interaction mode. The request may be submitted when the replica receives 2f +1 consistent commit messages, and the number 2f +1 may ensure that at least f +1 correct replica nodes will submit the request and they will not submit any further requests under the same sequence number.

There are 3f +1 replica nodes in the system, where f is the wrong replica node, and both commit opportunities require at least 2f +1 replica nodes to complete the local commit of the request. Any two 2f +1 s must intersect at a correct replica node, and the correct replica node will not submit and execute two different requests at the same sequence number. The number 2f +1 is a guarantee of security that the request, once completed, will be present in the log, with the result that the node cannot be changed even if there is a wrong copy. In terms of completion time and expansibility, the mechanism ensures that when no conflict exists or slight conflict exists, the consensus protocol can complete the request after 4 messages of the request sent by the client are delayed, and simultaneously ensures higher expansibility; in the case of a severe conflict, the consensus protocol may complete the request after 5 message delays for the client to issue the request.

The view transformation protocol is as follows: replica nodes work in a series of views, which refer to the current system configuration. Each view contains one master replica node and 3f slave replica nodes. Views are numbered consecutively with the primary replica node identified as p, p v mod |3f +1| where v represents the view number. When the slave replica node finds that the master replica node has errors or the system runs too slowly or the automatic master change is unsuccessful, the view conversion protocol is triggered. In the view transformation protocol, a replica node needs 3 phases to start a new view v + 1.

C1: broadcasting a view _ change message from the replica node p, telling other replica nodes to suspect the current master replica node and selecting a new master replica node through view conversion; when other replica nodes q receive f +1 view _ change messages sent by other replica nodes, broadcasting the view _ change messages, and determining to enter a view conversion stage;

c2: after receiving 2f +1 effective view _ change messages, a new primary replica node broadcasts new _ view messages to other replica nodes, wherein one new _ view message contains 2f +1 view _ change messages;

c3: after receiving the new _ view message from the replica node, determining the starting state of a new view according to the view _ change message contained in the new _ view message; after the replica node determines the new view state, sending a view _ confirm message to other replicas; and after all the replica nodes receive 2f +1 consistent view _ confirm messages, starting to process the messages in the new view. So far, the view transformation is successful.

When the replica node in the system processes the request, relevant information is recorded, and the information forms a log. If the log is not cleaned in time, a large amount of storage resources are occupied, and system performance and availability are affected. On the other hand, due to the presence of byzantine nodes, the consensus protocol does not guarantee that every node performs the same request, and therefore, the different replica node states may not be consistent. Therefore, periodic checkpoints are set in the byzantine system to synchronize copies in the system to one and the same state. Thus, the checkpoint protocol can periodically process logs, clean up garbage, and synchronize replica node states.

The log processing needs to distinguish which logs can be deleted and which logs still need to be preserved, and a stable checkpoint can be used for this purpose. The checkpoint protocol comprises the following specific steps:

after the replica nodes execute a certain number of requests, triggering a checkpoint protocol and including self history information in checkpoint messages and sending the checkpoint protocol to all other replica nodes;

when a copy node receives 2f +1 checkpoint messages, it indicates that the states contained in the checkpoint messages are at least consistent on f +1 correct nodes, and the checkpoint corresponding to the checkpoint message becomes a stable checkpoint. And the replica node performs log cleaning according to the sequence number corresponding to the stable check point, and updates the state of the replica node.

Compared with the prior art, the technical scheme of the invention has the following improvement effects:

(1) the present invention proposes a double-commit mechanism that allows a copy to have two opportunities to commit requests. The copy may complete the submission of the request at stage L4; if the submission of the request fails to complete at stage L4, the replica will have a second chance to submit the request, i.e., complete the submission of the request at stage L5. The invention of the secondary submission mechanism improves the expansibility of the algorithm under good conditions and reduces the time required by the node to achieve consensus under the condition that the node has errors (the master copy or the slave copy has errors).

(2) The causal owner-changing mechanism provided by the invention can determine whether to automatically change the owner according to the request submitting condition. If the request was successfully submitted at stage L4, do not rehost; if the request was successfully submitted at stage L5, the next copy becomes the new primary copy. The mechanism can reduce the probability of the malicious copy becoming the main copy and avoid the great reduction of the performance of the algorithm (under the condition that the malicious main copy exists).

(3) Based on the above two mechanisms, we designed RBFT. The invention improves the performance and the robustness of the existing BFT algorithm on the premise of ensuring the correctness, including delay, throughput and expansibility.

Drawings

FIG. 1 is a flow chart of a consensus protocol of the Byzantine consensus algorithm provided by the present invention;

FIG. 2 is a flow chart of a view transformation protocol for the Byzantine consensus algorithm provided by the present invention;

FIG. 3 is a flow chart of a check point protocol for the Byzantine consensus algorithm provided by the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and are used for illustration only, and should not be construed as limiting the patent. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

As shown in FIG. 1, a Byzantine fault-tolerant consensus protocol, working under a distributed system, has 3f +1 replica nodes in the system, where at most f nodes are faulty nodes and f is smaller than that of all replica nodes in the system

L4, checking the validity of the cert message from the copy:

Where the copy has two opportunities to submit requests:

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Reference documents:

【1】 Fangjie, easy day, shu sui, baizaning system technical research reviews [ J ] software bulletin, 2013(6): 1346-one 1360.

【2】Lamport L，Shostak R，Pease M.The Byzantine Generals Problem[J].ACMTransactions on Programming Languages and Systems，1982，4(3):382-401.

【3】DolevD，Lynch N A，Pinter S S，et al.Reaching approximate agreementin the presence of faults[J].Journal of the ACM，1986，33(3):499-516.

【4】Castro M，Liskov B.Practical Byzantine fault tolerance[C].Symposiumon Operating Systems Design&Implementation.1999.

【5】Malkhi D，Reiter M K.Byzantine quorum systems[J].DistributedComputing，1998，11(4):203-213.

【6】Abdelmalek M，Ganger G R，Goodson G R，et al.Fault-scalable Byzantinefault-tolerant services[J].symposium on operating systems principles，2005，39(5):59-74.

【7】Cowling J A，Myers D S，Liskov B，et al.HQ replication:a hybridquorum protocol for byzantine fault tolerance[C].operating systems design andimplementation，2006:177-190.

【8】Yin J，Martin J，Venkataramani A，et al.Separating agreement fromexecution for byzantine fault tolerant services[J].symposium on operatingsystems principles，2003，37(5):253-267.

【9】Kotla R，Alvisi L，Dahlin M，et al.Zyzzyva:speculative byzantinefault tolerance[J].symposium on operating systems principles，2007，41(6):45-58.

【10】Duan S,Peisert S,Levitt K N,et al.hBFT:Speculative ByzantineFault Tolerance with Minimum Cost[J].IEEE Transactions on Dependable andSecure Computing,2015,12(1):58-70.

【11】Veronese G S,Correia M,Bessani A,et al.Efficient Byzantine Fault-Tolerance[J].IEEE Transactions on Computers,2013,62(1):16-30.

【12】Liu J，Li W，Karame G O，et al.Scalable Byzantine Consensus viaHardware-Assisted Secret Sharing[J].IEEE Transactions on Computers，2019，68(1):139-151.

【13】Veronese G S,Correia M,Bessani A,et al.Spin One's Wheels？Byzantine Fault Tolerance with a Spinning Primary[C].symposium on reliabledistributed systems,2009:135-144.

【14】Amir Y,Coan B A,Kirsch J,et al.Byzantine replication under attack[C].dependable systems and networks,2008:197-206.

【15】Clement A,Wong E L,Alvisi L,et al.Making Byzantine fault tolerantsystems tolerate Byzantine faults[C].networked systems design andimplementation,2009:153-168.

Claims

1. A Byzantine fault-tolerant consensus protocol is based on a master-slave BFT algorithm, works in a distributed system, and has 3f +1 replica nodes, wherein the maximum f nodes are error nodes, and f is not larger than all the replica nodes in the system and comprises three sub-protocols of a consensus protocol, a view transformation protocol and a check point protocol; the consensus protocol coordinates each replica node to achieve agreement on the request execution sequence, and determines whether to automatically replace a new master replica node according to the request submission condition; when the duplicate nodes are overtime and can not reach the agreement or the automatic master changing is unsuccessful, the slave duplicate nodes trigger the view conversion protocol, select a new master duplicate node, and execute the consensus protocol again to ensure that the consensus can be achieved certainly; after the distributed system completes a certain number of requests, the logs of the replica nodes are cleaned, and the replica nodes update the states of the replica nodes.

2. The Byzantine fault tolerant consensus protocol according to claim 1, wherein said implementation of the consensus protocol comprises the steps of:

l1, the client c sends a request to all the copies, namely sends a request message to the copy nodes;

l2, after receiving the effective client request message, the replica node allocates the next sequence number s to be allocated to the request and sends the pre-prepare message to the main replica node;

and L3, the main replica node checks the validity of the pre-prefix message after receiving the pre-prefix message, and if the pre-prefix message passes the check, the main replica node receives the message. (1) If 2f +1 consistent pre-prefix messages are received, the messages form a cert message, and the master copy sends the cert message to the slave copy; (2) if the copy can not receive 2f +1 consistent pre-preamble messages, constructing a cert message and a preamble message according to the received legal pre-preamble messages, and then sending the messages to the slave copy by the master copy;

l4, checking the validity of the cert message from the copy:

l4-1. no conflict or mild conflict: if the cert message contains 2f +1 consistent pre-prefix messages, the copy may execute the request and reply a reply message to the client;

l4-2. heavy conflict: if the cert message fails to contain 2f +1 consistent pre-preamble messages, the copies construct commit messages according to the preamble messages and broadcast to other copies, and then go to L5;

l5, if the copies receive the commit messages which are matched with 2f +1 from other copies, the requests can be executed and reply messages to the clients;

l6. when the client receives 2f +1 consistent replies, which indicates that there are 2f +1 copies submitting and executing the request, the client may consider the request to be completed.

3. The Byzantine fault tolerant consensus protocol according to claim 1, wherein said view transformation protocol is implemented by the steps of:

c1, broadcasting a view _ change message from the replica node p, telling other replica nodes that the current master replica node is suspected and selecting a new master replica node through view conversion; when other replica nodes q receive f +1 view _ change messages sent by other replica nodes, broadcasting the view _ change messages, and determining to enter a view conversion stage;

c2, after receiving 2f +1 effective view _ change messages, the new primary replica node broadcasts new _ view messages to other replica nodes, wherein one new _ view message contains 2f +1 view _ change messages;

c3, after receiving the new _ view message from the replica node, determining the starting state of the new view according to the view _ change message contained in the new _ view message; after the replica node determines the new view state, sending a view _ confirm message to other replicas; and after all the replica nodes receive 2f +1 consistent view _ confirm messages, starting to process the messages in the new view. So far, the view transformation is successful.

4. The Byzantine fault-tolerant consensus protocol according to claim 1, wherein said checkpoint protocol comprises the following specific steps:

when a copy node receives 2f +1 checkpoint messages, the states contained in the checkpoint messages are at least consistent on f +1 correct nodes. And the replica node performs log cleaning according to the sequence number corresponding to the received checkpoint message, namely deleting the message log corresponding to the sequence number, and updating the state of the replica node.

5. The consensus protocol of claim 2, wherein said copies in said step have two opportunities to submit requests:

the copy may complete the submission of the request at stage L4; if the submission of the request fails to complete at stage L4, the replica will have a second chance to submit the request, i.e., complete the submission of the request at stage L5.

6. The consensus protocol of claim 2, wherein the step of whether the replica transformed the primary replica node after completion of the submission of the request:

if the request is a successful commit at stage L4, the master node continues to act as the master node; if the request did not commit successfully at stage L4, then the next slave copy of the current master copy will become the master copy in the new round of consensus process.