CN115277719B

CN115277719B - Priority-based consensus method and distributed system

Info

Publication number: CN115277719B
Application number: CN202210552297.7A
Authority: CN
Inventors: 杜志强; 屈直; 傅妍芳; 黄牧鸿; 马益帆; 张嘉恒
Original assignee: Xian Technological University
Current assignee: Xian Technological University
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2024-05-31
Anticipated expiration: 2042-05-20
Also published as: CN115277719A

Abstract

The invention discloses a priority-based consensus method and a distributed system, wherein the method comprises the following steps: after each node reaches the update condition corresponding to each attribute in the node state information, updating the node state information of the node; under a preset triggering condition, each follower node calculates a corresponding node priority according to the updated node state information; updating the node election timeout duration of the node according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range, so that when the updated node election timeout duration of the follower node in the distributed system arrives, a leader node election process is started; in the mapping relation, the higher the node priority is, the shorter the node election timeout duration is in the corresponding node election timeout duration range. The invention can improve the safety, stability, activity and operation efficiency of the system and reduce the communication overhead of the distributed system.

Description

Priority-based consensus method and distributed system

Technical Field

The invention belongs to the field of distributed systems, and particularly relates to a priority-based consensus method and a distributed system.

Background

In a distributed system, how to ensure that data in all nodes in a cluster are identical and can agree on a proposal (Proposal) is a core problem for the distributed system to work properly, and a consensus algorithm is a method for ensuring consistency of the distributed system. Currently Raft is a common algorithm. The Raft cluster contains a plurality of servers as nodes, each of which is in one of three states at any given time: a Leader (Leader), follower (Follower), or Candidate (CANDIDATE), which may be switched between. The Raft algorithm breaks down the consistency problem into two sub-problems, leader election and state replication, specifically using a heartbeat mechanism to trigger leader election.

Analysis of the election process of Raft algorithm can find that Raft algorithm uses a mechanism of randomly selecting candidate nodes and voting in the leader election step, which results in randomness of election results, and for a strong leader algorithm, the performance and stability of the leader node directly determine the running speed and safety of the whole cluster.

Specifically, as the number of nodes in the cluster increases, a random timeout mechanism used in the period of the node competition of the Raft algorithm leader can cause a situation that a plurality of nodes timeout to become candidate nodes (a new node timeout appears due to the fact that a previous candidate does not complete a voting process to become a candidate), so that votes of other nodes in the cluster can be divided into equal parts, any candidate cannot receive consent votes of more than half of nodes, the election process falls into a dead office, the Raft algorithm breaks the situation by continuing to wait for the next timeout to ensure the activity of the algorithm, but extra waiting time, namely timeout waiting time for more times, can also be caused, if the number of nodes in the cluster is too large, the frequency of occurrence of the situation can also be increased, and the extra waiting time can be more than the timeout time, so that the operation efficiency of the cluster is affected.

Disclosure of Invention

The embodiment of the invention aims to provide a priority-based consensus method and a distributed system so as to achieve the purpose of improving the safety, stability, activity and operation efficiency of the distributed system during operation. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a priority-based consensus method, applied to a distributed system, where the method includes:

after each node reaches the update condition corresponding to each attribute in the node state information, updating the node state information of the node; the updated node state information contains index values of a plurality of attributes representing the performance and the safety of the node;

under a preset triggering condition, each follower node calculates the corresponding node priority according to the updated node state information;

Each follower node updates own node election timeout duration according to the mapping relation between the corresponding node priority and the preset node priority and node election timeout duration range, so that when the updated node election timeout duration of the follower node in the distributed system arrives, a leader node election process is started; in the mapping relationship between the node priority and the node election timeout duration range, the higher the node priority is, the shorter the node election timeout duration in the corresponding node election timeout duration range is.

In one embodiment of the present invention, the updated node status information includes:

the throughput of the node, the number of times the node becomes a leader node, the number of times the follower node receives a write service request, the heartbeat message delay variation of the follower node and the consensus delay of the node.

In one embodiment of the present invention, the process of determining the change of the heartbeat message delay of the follower node includes:

For any follower node, the follower node calculates the difference between the heartbeat message Wen Shiyan between the currently obtained heartbeat message and the leader node and the heartbeat message time delay between the last obtained heartbeat message and the leader node to obtain the heartbeat message time delay change of the follower node.

In one embodiment of the present invention, the update condition corresponding to each attribute in the node status information includes:

aiming at the throughput of the node, the corresponding updating condition is that the leader node completes the writing operation;

Aiming at the times that the node becomes a leader node, the corresponding updating condition is that the node identity is converted into the leader node;

Aiming at the times that the follower node receives the writing service request, the corresponding updating condition is that the follower node receives the writing service request;

aiming at the time delay change of the heartbeat message of the follower node, the corresponding updating condition is that the follower node receives the heartbeat message from the leader node;

aiming at the common identification time delay of the nodes, the corresponding updating condition is that the leader node finishes the writing operation.

In one embodiment of the present invention, for a first leader election of the distributed system, the preset trigger condition is a start-up of the distributed system; and selecting each leader of the distributed system after the first time, wherein the preset triggering condition is that the follower node receives a request message sent by the leader node in the current time.

In one embodiment of the present invention, each follower node calculates a corresponding node priority according to the updated node status information, including:

for each follower node, determining the score value of each attribute in the updated node state information of the follower node in the attribute score relation preset by the distributed system; wherein the attribute score relation contains score values corresponding to all attributes in the updated node state information under different numerical ranges;

Obtaining node state information total points updated by the follower node based on summation of the score values of the various attributes;

determining the node priority of the follower node in a node priority dividing relation preset by the distributed system by utilizing the updated node state information total score of the follower node; the node priority dividing relation contains the node priority corresponding to the updated node state information total score under different numerical ranges.

In one embodiment of the present invention, the updating the node election timeout period according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout period range by each follower node includes:

Each follower node determines the node election timeout duration range after updating according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range;

And each follower node randomly determines a numerical value within the self-updated node election timeout duration range to serve as the self-updated node election timeout duration.

In one embodiment of the present invention, if there is a leader node in the distributed system, when there is a follower node in the distributed system that arrives at its updated node election timeout period, the method further includes:

The leader node is transformed into a follower node from the background, and the transformed follower node calculates the corresponding node priority and updates the node election timeout duration of the leader node according to the updated node state information determined before so as to participate in the leader node election flow.

In a second aspect, an embodiment of the present invention provides a distributed system, at least including a plurality of follower nodes, wherein:

Each node is used for updating the node state information of the node after the update conditions corresponding to the attributes in the node state information are reached; the updated node state information contains index values of a plurality of attributes representing the performance and the safety of the node;

Each follower node is used for calculating the corresponding node priority according to the node state information updated by the follower node under the preset trigger condition; updating the node election timeout duration of the node according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range, so that when the updated node election timeout duration of the follower node in the distributed system arrives, a leader node election flow is started; in the mapping relationship between the node priority and the node election timeout duration range, the higher the node priority is, the shorter the node election timeout duration in the corresponding node election timeout duration range is.

The invention has the beneficial effects that:

The consensus method based on the priority provided by the embodiment of the invention provides a new priority election overtime mechanism, and each node can update the node state information of the node in real time; under the preset triggering condition, each follower node calculates the corresponding node priority according to the updated node state information, and then updates the node election timeout duration according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range. According to the embodiment of the invention, more node state information is added and calculated for the nodes, different priorities are divided for all follower nodes according to the node state information, and the node election timeout duration in the node election timeout duration range corresponding to the higher node priority in the mapping relation is set to be shorter, so that the follower nodes with higher priority can reach the timeout state faster in the election process to start the leader node election process, and meanwhile, the problem of vote and melon division caused by random timeout duration can be effectively avoided, therefore, the system performance such as the safety, stability, activity, operation efficiency and the like of the system can be improved, and meanwhile, compared with the prior art, the communication expense of the system can be reduced.

Drawings

FIG. 1 is a schematic flow chart of a priority-based consensus method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of another method for priority-based consensus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a distributed system according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The Raft algorithm employs a simple mechanism to randomly elect timeout times, randomly selected from a fixed interval (e.g., 150-300 ms), to ensure that only one node reaches the timeout state in order to ensure that only one leader node in the cluster. The node identity which reaches the overtime state first is converted into a candidate, the wilt is self-increased, a voting request is sent to nodes in the cluster, the identity is converted into a leader node after receiving more than half of consent votes, and heartbeat messages are sent to other servers. The mechanism can simply and quickly solve the problem of ballot melon separation when the number of nodes is small.

However, the random means unpredictable, so that the node generated by election is likely to be any node in a cluster, and in one cluster, the performance and stability of each node are different to some extent, so that the leader node is the optimal node in the cluster, and the node meeting the requirement can be considered to be selected as the leader node by modifying the election mechanism.

In the prior art, zhang Sheng proposes a leader election scheme based on node priority in a distributed system data consistency study based on Raft algorithm, and the node with the highest priority in the cluster and meeting the leader election safety rule is preferentially elected to be the leader node according to the priority sequence among the nodes in the cluster so as to make the election result more reasonable.

Specific: each node needs to maintain a new data structure called a qualification table, in which qualification information of other nodes is recorded, and contents in the qualification table are ordered according to a given priority table. In practical application, the priority table is the priority order of the nodes provided by the cluster maintainer, which indicates the priority of the different nodes as the leader, and the priority of the nodes with smaller appointed numbers is higher. In the voting stage of the priority voting process, the qualification table is a reference basis for node voting, and the priority voting scheme agrees that the nodes only vote for the unique node with the highest priority in the local qualification table, and the other nodes uniformly vote for negation.

The leader election scheme based on node priority is divided into two stages of pre-voting and voting, and the pre-voting process is divided into two sub-stages of qualification confirmation and qualification writing. The qualification confirming stage is a process that the node confirms whether the node is qualified to be selected as the leader node in the leader election process of the round, and the passing condition is that more than half nodes meeting in the cluster agree to be the leader node. If the node is qualified as a leader node, after entering the qualification writing stage, the node needs to actively initiate qualification writing requests to other nodes and respond to the qualification writing requests from other nodes, otherwise, the node will give up to initiate the qualification writing requests in the qualification writing stage and only respond to the qualification writing requests of other nodes.

Nodes passing qualification confirmation need to write own qualification information into a self-maintained qualification table in a qualification writing stage, meanwhile, broadcast the qualification information into a cluster, unconditionally write the information into the self-maintained qualification table before the nodes receiving the information give out approval in a formal voting stage, and respond to a writing result. Then, the qualification writing requester needs to judge whether the qualification writing is successful or not according to the response result, and the judgment is based on whether the number of copies of the qualification information in the cluster exceeds half.

Nodes that pass the qualification write enter the voting phase after the pre-voting phase has passed and are ready to initiate a vote, during which they also respond to the voting requests of other nodes. If a node has determined to be disqualified (less than half of the nodes agree to become leader nodes themselves) during the qualification stage, then the qualification writing stage will refrain from writing and broadcasting qualification information, simply in response to the qualification writing requests of other nodes. If a node fails in the qualification writing phase, after entering the voting phase, a voting request is not initiated, and only the voting requests of other nodes are responded.

The voting stage is basically consistent with the voting process of the original leader election scheme, and only the nodes which perform the voting can refer to the qualification table maintained by the nodes in the voting process to determine whether to vote the voted, and the content in the qualification table is qualification information received from other nodes in the qualification writing stage.

However, the above solution does not give a specific and detailed priority policy, and introduces two additional requests for the RPC for qualification validation and qualification writing, thereby increasing the communication overhead of the system.

In order to improve the safety, stability, activity and operation efficiency of a distributed system during operation and reduce the communication overhead of the distributed system, the embodiment of the invention provides a priority-based consensus method and the distributed system.

In a first aspect, an embodiment of the present invention provides a priority-based consensus method, applied to a distributed system, including the following steps:

s1, after each node reaches the update condition corresponding to each attribute in the node state information, updating the node state information of the node.

The distributed system according to the embodiment of the present invention at least includes a plurality of follower nodes, and of course, after the first election, there is a leader node in the distributed system. In the embodiment of the invention, all nodes existing in the distributed system can carry out the processing of the step S1.

In the embodiment of the invention, the updated node state information contains index values of a plurality of attributes representing the performance and the safety of the node, and the index values are used for measuring the computing capacity, the computing efficiency, the safety and the like of the node.

For example, in an alternative embodiment, the updated node status information includes:

Compared with the prior art, the embodiment of the invention adds various attributes in the original state information of the node server so as to calculate the node priority corresponding to the follower node according to the updated node state information.

Where the throughput of a node represents the amount of successfully transferred data per unit time. The Raft algorithm does not consider the load offset problem of the system in the normal operation process, so the embodiment of the invention calculates the throughput of the node in real time to solve the problem.

The number of times a node becomes a leader node indicates the total number of times the node has accumulated to act as a leader node at the current time. In order to solve the centralization problem of the system, under the condition of not considering the node performance difference, each node is guaranteed to have the opportunity of becoming the leader node.

The number of times the follower node receives the write-in service request is mainly used for counting the number of times the follower node needs to forward the received write-in service request. The log entries of Raft algorithm are only sent from the leader node to other servers, when the writing service request of the client is sent to the follower node, the writing service request is forwarded to the leader node through the follower node, and when a large number of writing service requests are directly sent to the follower node, the communication overhead that the follower node forwards the request to the leader node again is not ignored, so the embodiment of the invention adds the times that the follower node receives the writing service request in the updated node state information.

In the embodiment of the invention, the distribution of each node in the cluster is diversified in actual production environment, the time delay between each node and the leader node is different, the time delay between the leader node which is relatively far away from the center of the cluster and most of nodes in the cluster is higher than the time delay between the nodes which are closer to the center and other nodes, and in order to ensure that the performance of the cluster is not influenced by the performance of a single node, the time delay between the leader node and other nodes in the cluster should be considered. Specifically, the heartbeat message delay change and the consensus delay of the follower node are considered.

The process for determining the heartbeat message delay change of the follower node comprises the following steps:

The time delay of the heartbeat message between the follower node and the leader node is calculated by measuring the time consumption of the heartbeat message transmission, wherein the time stamp of the transmission message can be added in the request message sent by the leader node, and then when the follower node receives the request message, the time delay of the heartbeat message between the follower node and the leader node can be calculated according to the time stamp of the transmission message and the time of receiving the request message.

The heartbeat message time delay change of the follower node is selected as the attribute finally added into the updated node state information, so that the time delay among the nodes in the cluster is often affected by various factors and is not the same, and the stability of the time delay among the nodes directly influences the performance of the system. Therefore, the network delay change between the nodes is calculated by using the heartbeat message delay between the follower node and the leader node so as to reflect the connection stability between the nodes.

The common delay of the nodes is a technical index which directly reflects the performance of the storage system, and represents the total consumption of the distributed system for processing the write-once service request, including the time consumption of the leader node for copying the log to all the nodes in the cluster.

In the embodiment of the invention, different items of attribute of node state information and corresponding updating time are not the same.

In an optional implementation manner, the updating condition corresponding to each attribute in the node state information includes:

(1) For the throughput of the node, the corresponding update condition is that the leader node completes the write operation.

Specifically, the leader node updates its own "throughput of nodes" after each write operation is completed, so as to update the node status information. However, since the follower node does not perform a write operation, the attribute is not updated, and the original value is maintained. The throughput of each node is to continuously store updated data, and zero clearing operation is not performed.

(2) For the number of times that a node becomes a leader node, the corresponding update condition is that the node identity is converted into the leader node.

Specifically, a follower node updates its own node to be the leader node for a number of times only after it is converted to the leader node, i.e., performs a value plus one operation, otherwise, maintains the original value. Similarly, for each node, the number of times the node becomes the leader node is to continuously save updated data and zero clearing operation is not performed.

(3) Aiming at the times that the follower node receives the writing service request, the corresponding updating condition is that the follower node receives the writing service request.

Specifically, after a follower node receives a write service request, the follower node updates the number of times the follower node receives the write service request, i.e. performs a value plus one operation. The value of the attribute is cleared when the follower node transitions to the leader node. The leader node does not update the attribute, and maintains the original value of 0.

(4) Aiming at the heartbeat message delay change of the follower node, the corresponding updating condition is that the follower node receives the heartbeat message from the leader node.

Specifically, after the follower node receives the heartbeat message from the leader node, the follower node updates the own 'heartbeat message delay change of the follower node', and continuously stores the updated data. The leader node does not update the attribute, and maintains the original value, for example, if it has not previously acted as a follower node, the value is still 0; if it previously acted as a follower node, it inherits the value that was updated last time as the follower node. The leader node will only update the attribute after it has transitioned to the follower node and received a heartbeat message from the new leader node.

(5) Aiming at the common delay of the nodes, the corresponding updating condition is that the leader node completes the writing operation.

Specifically, the leader node updates its own "consensus delay" each time a write operation is completed. However, since the follower node does not perform a write operation, the attribute is not updated, and the original value is maintained. For each node, the common delay of the node is to continuously store updated data, and zero clearing operation is not performed.

It can be understood that, for each node in the embodiment of the present invention, according to the processing manner of step S1, different attribute values in the node state information are actually updated under different time periods, so as to achieve the purpose of updating the own node state information in real time.

S2, under the preset triggering condition, each follower node calculates the corresponding node priority according to the updated node state information.

In an alternative embodiment, the preset trigger condition is a start-up of the distributed system for a first leader election of the distributed system.

Specifically, after the distributed system is started, all nodes are follower nodes, node state information updated by each node is an initial value of 0 by default, and the node priority calculation strategy according to the embodiment of the invention can calculate that all nodes are the same node priority. The node priority calculation strategy pertaining to the embodiments of the present invention is specifically described later.

In an alternative embodiment, for each leader election of the distributed system after the first time, the preset trigger condition is that the follower node receives a request message sent by the leader node in the current time.

In such an embodiment, the distributed system includes one leader node and a plurality of follower nodes.

As known to those skilled in the art, after completing the first round of election and entering a normal running state, the Raft algorithm completes the interaction between the nodes through APPEND ENTRIES protocol, and the leader node periodically sends APPEND ENTRIES RPC request message (simply called request message in the embodiment of the present invention) to the follower node, where the request message is on one hand used as heartbeat information to inform the follower node of its own state information, consolidates the identity state of the leader node, and can also utilize APPEND ENTRIES RPC response message fed back by the follower node to master the working condition of the follower node; another aspect is sending log data to be replicated to follower nodes within the cluster.

In this embodiment, after each follower node receives a request message sent by a current leader node, according to node state information updated by itself, the corresponding node priority is calculated by using the node priority calculation policy in the embodiment of the present invention.

In order to select nodes with better performance, faster communication and higher stability in the election stage, the embodiment of the invention provides a new priority election timeout mechanism, and under the preset trigger condition, the node priority calculation strategy of the embodiment of the invention is utilized to calculate the corresponding node priority by evaluating the updated node state information of the follower node, calculates the node election timeout duration according to the node priority and synchronously updates the node state information when the node state information changes.

Specifically, in the embodiment of the invention, the node priorities of a plurality of different levels are divided in advance, and can be represented by a certain character distinction, such as Arabic numerals, chinese characters or other character symbols. Taking Arabic numbers as an example, the node priority levels can be represented by 1, 2,3 and the like, and the smaller the numerical value corresponding to the node priority level is, the higher the priority level is. Under the preset triggering condition, each follower node calculates the node priority of the follower node in real time by considering the influence of each attribute on the node priority according to the updated node state information.

In an optional implementation manner, each follower node calculates a corresponding node priority according to the node state information updated by itself, and the method includes:

s21, determining the score value of each attribute in the updated node state information of each follower node in the attribute score relation preset by the distributed system aiming at each follower node.

The embodiment of the invention utilizes a large amount of sample data in advance, and an attribute score relation suitable for a distributed system is constructed through experiments. The attribute score relation contains score values corresponding to all attributes in the updated node state information under different numerical ranges. The attribute score relationship may be in the form of a table or the like. Taking the common delay of one attribute-node in the updated node state information as an example, the embodiment of the invention can pre-determine the common delay value interval of a node with a larger range by utilizing experimental data, divide the common delay value interval into a plurality of cells, continuously set the upper numerical limit of a front cell and the lower numerical limit of a rear cell in two adjacent cells, set a corresponding score value for each cell, and set the score values among different cells. The remaining attributes are similar to the common delay of the nodes and will not be described in detail here.

Specifically, for the throughput of a node, the higher the value, the more capable the node is to transmit data, and is more suitable as a leader node. While for the number of times the follower node receives the write service request, it will be understood by those skilled in the art that if the follower node cannot process the write service request, the write service request needs to be forwarded to the leader node, and the leader node can directly process the write service request. The time for the follower node to forward the writing service request is consumed, and the embodiment of the invention counts the times for the follower node to receive the writing service request in order to count the times for the follower node to forward the writing service request, so that the node with more times of forwarding is more suitable for becoming the leader node, and the time for forwarding can be saved.

Therefore, for the two attributes, namely the throughput of the node and the number of times the follower node receives the writing service request, are positive correlation factors in the calculation of the priority of the node, and therefore, the embodiment of the invention sets that the larger the attribute value is, the higher the corresponding score value is.

As for the number of times a node becomes a leader node, it will be understood by those skilled in the art that as the number of times a node acts as a leader node increases, the degree of centralization of the system increases, affecting the security of the system, and thus the node is less suitable to act as a leader node again. For the heartbeat message delay change of the follower nodes and the common-knowledge delay of the nodes, it can be understood that if the heartbeat message delay change of one follower node is larger or the common-knowledge delay of the nodes is larger, the stability and the processing efficiency are poorer, so that the method is more unsuitable for functioning as the leader node again.

Therefore, for the three attributes, namely the number of times that the node becomes the leader node, the heartbeat message time delay change of the follower node and the consensus time delay of the node, are negative correlation factors in the node priority calculation, the embodiment of the invention is set to be that the larger the attribute value is, the lower the corresponding score value is.

The different numerical ranges and corresponding score values for each attribute in the attribute score relationship are described in specific examples hereinafter.

S22, obtaining the updated node state information total score of the follower node based on summation of the score values of the various attributes.

In an alternative implementation manner, the score values of all the attributes can be directly added to obtain the total score of the updated node state information of the follower node, so that the purpose of simple, convenient and quick calculation is realized.

In an alternative embodiment, a large amount of experimental data and actual experience can be utilized in advance to determine the influence degree of each attribute in the updated node state information on the node priority, corresponding weight coefficients are configured for the score values of each attribute, and the weight summation of the score values of each attribute is realized by utilizing the corresponding weight coefficients, so that the node priority determination result which meets the actual requirements and has higher accuracy is obtained later.

S23, determining the node priority of the follower node in a preset node priority dividing relation of the distributed system by utilizing the updated node state information total score of the follower node.

The embodiment of the invention utilizes a large amount of sample data in advance, and a node priority division relation suitable for a distributed system is constructed through experiments. The node priority dividing relation contains the node priority corresponding to the updated node state information total score under different numerical ranges. The node priority division relationship is divided into a plurality of node priorities, and each node priority corresponds to a numerical range of the updated node state information total score. And successively setting the upper numerical limit of the former cell and the lower numerical limit of the latter cell between the cells corresponding to the numerical ranges of the total scores of the two adjacent updated node state information. The larger the value of the updated node state information total score corresponding to the cell, the higher the node priority. Likewise, the node prioritization relationship may be in the form of a table or the like.

It may be understood that in the embodiment of the present invention, the node priority calculation policy includes both positive correlation factors and negative correlation factors, for example, the node priority increases with the throughput of the node in the updated node state information and the number of times the follower node receives the write service request, and decreases with the number of times the node becomes the leader node, the heartbeat message delay of the follower node changes, and the number of times the node has a common delay increases, so that it can be ensured that the node priorities of all the follower nodes dynamically change, and the adaptability and flexibility are stronger.

And S3, updating the node election timeout duration of each follower node according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range, so that when the updated node election timeout duration of the follower node in the distributed system arrives, the leader node election process is started.

The embodiment of the invention utilizes a large amount of sample data in advance, and a mapping relation suitable for the node priority and the node election timeout duration of the distributed system is established through experiments, wherein the higher the node priority is, the shorter the node election timeout duration is in the corresponding node election timeout duration range in the mapping relation of the node priority and the node election timeout duration range.

Specifically, the node priority in the mapping relationship between the node priority and the node election timeout period range corresponds to the node priority in the node priority dividing relationship. In the mapping relation between the node priority and the node election timeout duration range, each node priority corresponds to one node election timeout duration range, and one preset node election timeout duration range, such as 150-300ms, can be divided into a plurality of timeout duration cells according to the number of the node priorities, and similarly, the numerical upper limit of the former timeout duration cell and the numerical lower limit of the latter timeout duration cell are continuously set in two adjacent timeout duration cells. For the node priority with higher priority, the value among the corresponding timeout duration cells is smaller.

Each follower node can determine a corresponding node election timeout duration range from the mapping relation between the node priority and the node election timeout duration range by using the node priority determined in the step S2, and then update the node election timeout duration of the follower node. It will be appreciated that under such a setting, a higher priority follower node can have a greater chance of becoming a candidate node and thus be expected to become a leader node because it has a shorter node election timeout period.

In an optional implementation manner, each follower node updates its own node election timeout period according to a mapping relationship between a corresponding node priority and a preset node priority and a node election timeout period range, including:

and each follower node determines the self-updated node election timeout duration range according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range.

And each follower node determines a numerical value as the self-updated node election timeout duration according to a preset corresponding relation in the self-updated node election timeout duration range.

In this embodiment, for a follower node, after determining the self-updated node election timeout duration range, a corresponding node election timeout duration in the updated node election timeout duration range is selected according to a predetermined correspondence. The predetermined correspondence may be related to the number of updates of the follower node for the node election timeout period; for example, when the current node is the nth update of the follower node, the number m of the values in the node election timeout duration range after the self update can be determined, n is mapped into a value h between [1, m ] by using a preset normalization method, and then the h value is selected in the node election timeout duration range after the self update and is used as the node election timeout duration after the self update. For example, when n=1000, the remainder of n/m can be calculated to obtain h, and then the updated node election timeout duration is obtained. Of course, the predetermined correspondence may also be related to the current time, for example, the current time is transformed into a numerical number, and the numerical value corresponding to the numerical number is selected in the node election timeout duration range updated by the follower node, which is reasonable.

In an alternative embodiment, each follower node updates its own node election timeout according to the mapping relationship between the corresponding node priority and the preset node priority and the node election timeout range, including:

And each follower node randomly determines a numerical value in the self-updated node election timeout duration range as the self-updated node election timeout duration.

In the embodiment, even if the node election timeout duration ranges determined by the two follower nodes are the same, as each value is randomly selected from the same node election timeout duration range to serve as the node election timeout duration after self updating, the repetition probability of the shortest node election timeout duration is low, and therefore the uniqueness and rapidity of the follower nodes which reach the node election timeout duration after updating first in the distributed system are guaranteed.

Meanwhile, in the embodiment, only a part of updated node election timeout duration of the follower nodes with the highest priority is relatively close, which is equivalent to indirectly reducing nodes participating in election, and simultaneously screening out nodes with better performance and stability.

It should be noted that, if there is a leader node in the distributed system, when there is a follower node in the distributed system that arrives at the node election timeout duration after updating, the method further includes:

That is, the leader node updates its own node status information in real time, but only after the transition from the lower stage to the follower node, steps S2 to S3 of the embodiment of the present invention are executed according to the processing manner of the follower node. And will not be described in detail.

The following describes a complete process of the priority-based consensus method according to an embodiment of the present invention with reference to fig. 2. It should be noted that fig. 2 is an illustration of a leader election of the distributed system after the first time. For the first leader election of a distributed system, please understand in conjunction with FIG. 2 and the related content below.

In the embodiment of the invention, the priority of the nodes is marked as 1,2 and 3 from high to low. The nodes with different priorities are allocated with random election timeout duration ranges with different levels, the node election timeout duration range with 150-300ms can be divided into 3 subintervals, and the three subintervals respectively correspond to the priorities of three nodes, and are specific: nodes with node priority 1 use an election timeout duration range of 150-200 ms; the node with the node priority 2 uses an election timeout duration range of 200-250 ms; the node of node priority 3 uses an election timeout period range of 250-300 ms. Of course, in the embodiment of the invention, the priority of the nodes with the rest numbers can be divided according to the needs; it is reasonable to select the remaining value ranges except the node election timeout period range of 150-300ms as required.

Specifically, after the distributed system is started, the node priority calculation policy according to the embodiment of the invention determines that the node priorities of all the nodes are the initial node priority 2, and each node randomly selects a value from one of the node election timeout periods as the node election timeout period after updating according to a node election timeout period range corresponding to the node priority 2 in a mapping relation between the preset node priorities and the node election timeout period range, namely, allocates a random election timeout period corresponding to the lowest priority to each node. Then, according to the existing Raft algorithm, if a follower node first reaches its random election timeout period, it will be converted to a candidate node to initiate election to select the leader node.

When the first leader node is selected, the count of the leader times of the leader node is increased by one, after the leader node receives the writing service processing request and processes the writing service processing request, the throughput, the consensus time delay, the number of times of becoming the leader node and the propagation time delay stability (namely, the time delay change of heartbeat messages of the follower nodes) are also increased, during the period, each node updates own node state information in real time, each follower node waits for receiving a request message sent by the leader node, after the follower node receives the request message sent by the leader node, the node priority is calculated according to the updated node state information, then the node election timeout duration is updated by the node priority, and finally the election timer is reset. When the updated node election timeout time length of a follower node arrives first, namely the election timeout occurs, at the moment, the current leader node is down, the follower node with the election timeout becomes a candidate node and broadcasts a voting request in a cluster to trigger a new round of election, and when the candidate node receives that the number of agreeing votes exceeds half of the total number of nodes in the distributed system, the candidate node becomes the new leader node.

In the above process, each follower node has different node priorities due to the difference of node state information attributes such as the number of times of receiving a write request and propagation delay stability, and is configured with different levels of random election timeout periods accordingly, and the node with high priority becomes a candidate node more easily because of being allocated with faster random election timeout periods.

The following takes the first election process of the leader node as an example when the leader node is down due to the accident. Assume that the updated node state information of each follower node is shown in table 1.

Table 1 updated node status information for each follower node

The distributed system comprises 5 nodes S1-S5, after the current leader node S1 is turned into a follower node due to downtime, the node priority is set to be the lowest level 3, and all five nodes are the follower nodes currently. The throughput of the node, the number of times the node becomes the leader node, the number of times the write service request is received by the follower node, the change in the delay of the heartbeat message of the follower node, and the common delay of the node are respectively and correspondingly represented in table 1 by tps, leaderCount, requestCount, jitter and dealTime.

Tables 2 and 3 are respectively the attribute score relationship of two attributes as positive correlation factors and the attribute score relationship of three attributes as negative correlation factors in the embodiment of the present invention. Table 4 is a mapping relationship between node priority and node election timeout duration range in the embodiment of the present invention.

TABLE 2 attribute score relationship for positively correlated factors

TABLE 3 attribute score relationship for negative correlation factors

Table 4 mapping relationship between node priority and node election timeout period range

The embodiment of the invention can directly sum the score values of all the attributes to obtain the node state information total score updated by the follower node.

The remaining follower nodes S2 to S5 respectively determine respective real-time node priorities according to the node priority calculation policies according to the embodiments of the present invention according to tables 1 to 4. The node priorities of the determined five follower nodes are shown in table 5.

Table 5 node priority for each follower node

The node priority is expressed as priority.

And then, each follower node determines a node election timeout duration range corresponding to the own node priority according to the preset mapping relation between the node priority and the node election timeout duration range, and randomly selects a numerical value from the node election timeout duration range as the own updated node election timeout duration. And through calculation, the node priority of the follower node S2 is 1, and the random election timeout duration distributed by the node is less than the follower nodes S3, S4 and S5 with the node priority of 2 according to the mapping relation between the node priority and the node election timeout duration range. Therefore, the follower node S2 reaches the election timeout first and converts the election timeout into the candidate identity, and simultaneously sends a Request Vote RPC message to initiate a voting Request, and converts the voting Request into a leader state after obtaining more than half of consent votes.

In the practical application process, the specific values of tables 1 to 4 can be designed according to the conception of the node priority calculation strategy according to the embodiment of the invention in combination with scene demands to perform priority calculation, so as to determine the node election timeout duration after updating each follower node.

Therefore, the priority election timeout mechanism provided by the embodiment of the invention achieves the purpose of calculating the node priority corresponding to the node in real time by adding the priority influencing factors of various strategies into the node state information of the node by increasing the throughput of the node, the number of times the node becomes a leader node, the number of times the follower node receives a writing service request, the heartbeat message delay change of the follower node and the consensus delay of the node. The random election timeout mechanism of the priority provided by the embodiment of the invention modifies the random election timeout mechanism of Raft algorithm, allocates different levels of random election timeout duration to nodes of different priority levels, allocates shorter timeout duration to nodes of high priority level, and allocates longer timeout duration to nodes of low priority level, so that high-level nodes have higher probability of becoming candidate nodes due to shorter random time duration, thereby solving the problem of random election timeout mechanism in the prior art. Meanwhile, the node priority calculation strategy of the embodiment of the invention comprises both positive correlation factors and negative correlation factors, so that the node priorities of all the follower nodes can be ensured to be dynamically changed, and the leader election is dynamically carried out according to the dynamically changed node priorities, so that the adaptability and the flexibility are stronger. Therefore, the embodiment of the invention can effectively avoid the problem of ballot melon separation caused by random timeout duration, so that the system performance such as the safety, stability, activity, operation efficiency and the like of the system can be improved, and meanwhile, compared with the prior art, the communication overhead of the system can be reduced.

In a second aspect, corresponding to the above embodiment of the method, as shown in fig. 3, an embodiment of the present invention further provides a distributed system, at least including a plurality of follower nodes, where:

Each follower node is used for calculating the corresponding node priority according to the node state information updated by the follower node under the preset trigger condition; updating the node election timeout duration of the node according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range, so that when the updated node election timeout duration of the follower node in the distributed system arrives, a leader node election process is started; in the mapping relation between the node priority and the node election timeout duration range, the higher the node priority is, the shorter the node election timeout duration in the corresponding node election timeout duration range is.

The leader node and its communication relationship with each follower node are illustrated in fig. 3 by dashed boxes, indicating that the leader node is not necessarily present in the distributed system.

Optionally, the updated node status information includes:

Optionally, the determining process of the heartbeat message delay variation of the follower node includes:

Optionally, the update conditions corresponding to each attribute in the node state information include:

aiming at the times that the node becomes the leader node, the corresponding updating condition is that the node identity is converted into the leader node;

Aiming at the common delay of the nodes, the corresponding updating condition is that the leader node completes the writing operation.

Optionally, for the first leader election of the distributed system, the preset trigger condition is the start of the distributed system; for each leader election of the distributed system after the first time, the preset trigger condition is that the follower node receives a request message sent by the leader node in the current time.

Optionally, when each follower node calculates the corresponding node priority according to the node state information updated by itself, the method is specifically used for:

Determining the score value of each attribute in the updated node state information of the follower node in the attribute score relation preset by the distributed system; the attribute score relation contains score values corresponding to all attributes in the updated node state information under different numerical ranges;

Obtaining node state information total points updated by the follower node based on summation of the score values of all the attributes;

Determining the node priority of the follower node in a node priority dividing relation preset by a distributed system by utilizing the updated node state information total score of the follower node; the node priority dividing relation contains the node priority corresponding to the updated node state information total score under different numerical ranges.

Optionally, when updating the node election timeout duration according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range, each follower node is specifically configured to:

Determining the self-updated node election timeout duration range according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range;

And randomly determining a numerical value in the self-updated node election timeout duration range as the self-updated node election timeout duration.

Optionally, if there is a leader node in the distributed system, when there is a follower node in the distributed system and the updated node election timeout duration of the follower node arrives, the leader node is further configured to convert the lower stage into the follower node, and the converted follower node calculates the corresponding node priority and updates the node election timeout duration of the follower node according to the updated node state information determined previously, so as to participate in the leader node election process.

The specific content is described in the first aspect, and will not be described herein.

According to the distributed system provided by the embodiment of the invention, more node state information is added and calculated for the nodes, different priorities are divided for all follower nodes according to the node state information, and the node election timeout duration in the node election timeout duration range corresponding to the higher node priority in the mapping relation is set to be shorter, so that the follower nodes with higher priority can reach the timeout state faster in the election process to restart the leader node election process, and meanwhile, the problem of ballot melon distribution caused by random timeout duration can be effectively avoided, and therefore, the system performance such as the safety, stability, activity, operation efficiency and the like of the system can be improved, and meanwhile, compared with the prior art, the communication cost of the system can be reduced.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Further, one skilled in the art can engage and combine the different embodiments or examples described in this specification. In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A priority-based consensus method for use in a distributed system, the method comprising:

Each follower node updates own node election timeout duration according to the mapping relation between the corresponding node priority and the preset node priority and node election timeout duration range, so that when the updated node election timeout duration of the follower node in the distributed system arrives, a leader node election process is started; in the mapping relation between the node priority and the node election timeout duration range, the higher the node priority is, the shorter the node election timeout duration in the corresponding node election timeout duration range is;

wherein the updated node status information includes:

The throughput of the node, the number of times the node becomes a leader node, the number of times the follower node receives a writing service request, the heartbeat message delay change of the follower node and the consensus delay of the node;

the update conditions corresponding to the attributes in the node state information comprise:

2. The priority-based consensus method according to claim 1, wherein the determining of the heartbeat message delay variation of the follower node comprises:

3. The priority-based consensus method according to claim 1, wherein the preset trigger condition is a start-up of the distributed system for a first leader election of the distributed system; and selecting each leader of the distributed system after the first time, wherein the preset triggering condition is that the follower node receives a request message sent by the leader node in the current time.

4. The priority-based consensus method according to claim 1, wherein each follower node calculates a corresponding node priority according to the self-updated node state information, comprising:

5. The priority-based consensus method according to claim 1 or 4, wherein each follower node updates its own node election timeout according to a mapping relationship between a corresponding node priority and a preset node priority and a node election timeout range, comprising:

6. The priority-based consensus method according to claim 1, wherein if there is a leader node in the distributed system, when there is a follower node in the distributed system that arrives at its updated node election timeout period, the method further comprises:

7. A distributed system comprising at least a plurality of follower nodes, wherein:

Each follower node is used for calculating the corresponding node priority according to the node state information updated by the follower node under the preset trigger condition; updating the node election timeout duration of the node according to the mapping relation between the corresponding node priority and the preset node priority and the node election timeout duration range, so that when the updated node election timeout duration of the follower node in the distributed system arrives, a leader node election flow is started; in the mapping relation between the node priority and the node election timeout duration range, the higher the node priority is, the shorter the node election timeout duration in the corresponding node election timeout duration range is;

wherein the updated node status information includes: