CN113259188A

CN113259188A - Method for constructing large-scale redis cluster

Info

Publication number: CN113259188A
Application number: CN202110798396.9A
Authority: CN
Inventors: 于光杰; 刘启铨; 曾力耕
Original assignee: Whale Cloud Technology Co Ltd
Current assignee: Whale Cloud Technology Co Ltd
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-08-13

Abstract

The invention discloses a method for constructing a large-scale redis cluster, which comprises the following steps: establishing communication connection with other remote dictionary service nodes according to different attributes of the remote dictionary service nodes; carrying out information synchronization through the established communication connection and a communication connection module based on the cluster nodes; if the nodes in the cluster are hung, each management node carries out fault negotiation judgment on the node; and if the slave nodes initiate election voting, voting is carried out by each management node and a new main node is selected, and meanwhile the new main node replaces the failed main node. Has the advantages that: the invention can greatly reduce the number of network connections to be established for communication among cluster nodes; network bandwidth occupied by communication among cluster nodes is effectively reduced, and occupied host resources are further reduced, so that the cluster scale can reach thousands of nodes.

Description

Method for constructing large-scale redis cluster

Technical Field

The invention relates to the field of distributed cache, in particular to a method for constructing a large-scale redis cluster.

Background

An open source Redis (remote dictionary service) product provides a Redis Cluster mode, a Cluster without a central architecture is adopted, and in order to maintain the uniformity of Cluster state information, information exchange needs to be carried out among nodes in the Cluster: each node in the cluster is connected with all other nodes, the cluster is called Gossip (Gossip) by adopting a message exchange mode, each Redis node periodically exchanges information with other nodes, and the information exchange with all nodes needs to be completed within a certain time, so that all nodes finally achieve the consistency of the information.

As shown in fig. 5, in this cluster construction manner, because the cluster nodes are connected two by two, as the number of nodes in the cluster increases, the number of cluster network connections will increase exponentially; and with the increase of the number of cluster nodes, the network bandwidth occupation of the cluster communication is also greatly increased, thereby occupying more host resources. The cluster construction scale is limited by this, and more than 200 nodes are not recommended.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a method for constructing a large-scale redis cluster, so as to overcome the technical problems in the prior related art.

Therefore, the invention adopts the following specific technical scheme:

a method of constructing a large-scale redis cluster, the method comprising the steps of:

s1, establishing communication connection with other remote dictionary service nodes according to different attributes of the remote dictionary service nodes;

s2, carrying out information synchronization through the established communication connection and the communication connection module based on the cluster nodes;

s3, if the nodes in the cluster are hung, each management node carries out fault negotiation judgment on the node;

s4, if the slave node initiates election voting, each management node votes and selects a new master node, and the new master node replaces the failed master node;

the cluster nodes are divided into management nodes and common nodes, the management nodes are connected and communicated with all other nodes, fault judgment and master-slave fault election of the nodes are carried out, and the common nodes are connected and communicated with the management nodes and the nodes in master-slave relation.

Further, if the node in the cluster is hung up in S3, the negotiation determination of the fault by each management node for the node further includes the following steps:

s31, the management node starts and establishes long-term communication connection with all other nodes;

s32, if the connection between the management node and each node is failed to be established, the reconnection is carried out at regular time until the connection is established;

s33, starting the common node and establishing long-term communication connection with all management nodes and the owner/slave nodes of the common node;

s34, if the connection between the common node and other nodes fails to be established, the reconnection is carried out at regular time until the connection is established;

s35, each node performs one-time heartbeat communication with all connected nodes within a timeout period t1, and records the initiation time of a heartbeat request packet of the heartbeat communication;

s36, after each node receives the heartbeat request packet of other nodes in the step S35, the state of each node is checked, and a heartbeat response packet is replied;

and S37, after receiving the heartbeat response packet of the other node in the step S36, each node resets the initiation time of the heartbeat request packet to 0 and processes the state of each node in the heartbeat request packet.

Further, the total number expression of the connections finally established by the management node is as follows:

e (m, n) = (m-1 + n) × m; the number of management nodes is represented as m, and the number of common nodes is represented as n.

Further, the total number expression of the connections finally established by the common node is as follows:

f (m, n, r) is less than or equal to m x n and less than or equal to (m + r-1) x n; the number of management nodes is represented as m, the number of common nodes is represented as n, and the number of data copies is represented as r.

Further, the contents of the heartbeat request packet and the heartbeat response packet both include fragment information of the node and state information of 10% of nodes.

Further, in S35, each node performs one heartbeat communication with all connected nodes within a timeout period t1, and recording the initiation time of the heartbeat request packet of this heartbeat communication further includes the following steps:

s351, if the heartbeat request packet of each node in the step S35 does not receive the heartbeat response packet of the step S36 within the over-half timeout period t1/2, disconnecting the node, and reestablishing the connection in the steps S31-S34;

and S352, if the heartbeat request packet in the step S35 of each node does not receive the heartbeat response packet in the step S36 within the timeout period t1, marking the node as a suspected fault state.

Further, after receiving the heartbeat request packet of the other node in step S35, each node in S36 checks the status of each node included in the heartbeat request packet, and replies to the heartbeat response packet further includes the following steps:

s361, if the sending node is a management node and a node marked as a suspected fault state exists in the heartbeat request packet, adding the node in the suspected fault state into a suspected fault report list of the node recorded in the node;

s362, if the node is a management node and the report number of the node in the suspected fault state exceeds half of the management nodes, modifying the node state in the suspected fault state from the suspected fault state to a fault, simultaneously broadcasting and notifying all other surviving nodes, and marking the fault node state as a fault;

s363, if the sending node is a management node and there is a node marked as a failure state in the heartbeat request packet, marking the corresponding node state recorded in the node as a failure;

s364, if the sending node is a management node and the heartbeat request packet contains a node marked as a normal state, deleting the node in the normal state from a suspected fault report list of the node recorded in the node;

s365, if the node is a common node and the state of the node is a fault, the state of the node needs to be recovered to be normal;

after receiving the heartbeat response packet of the other node in step S36, each node in S37 resets the initiation time of the heartbeat request packet to 0, and performs processing on the state of each node in the heartbeat request packet, wherein the processing steps are the same as steps 361-365.

Further, if the number of reports of the suspected node that is recorded by the management node is x, the conditional expression for determining the node fault is as follows:

wherein, the number of the management nodes is represented as m.

Further, in S4, if the slave node initiates election voting, each management node votes and elects a new master node, and the replacement of the failed master node by the new master node further includes the following steps:

s41, when the master node state of the slave node is failure, the slave node sends a master-slave failure switching voting request to all management nodes;

s42, after receiving the failover voting request in step S41, if the master node has not voted for the failure in the voting period 2 × t1, the management node replies a response packet to the slave node initiating the voting;

s43, after receiving the voting response returned by the management node in the step S42 from the node, adding 1 to the voting count;

s44, if the voting count exceeds half of the management nodes, the slave node switches itself to the master node and informs all other connection nodes by broadcasting, the data fragment is switched from master to slave, and one management node is selected for secondary broadcasting notification;

s45, after each node receives the master-slave switching notice in the step S44, the node updates the slave node corresponding to the data fragment new master node, the original master node is updated to be the slave node, and if the node is a management node needing secondary broadcasting, the master-slave switching notice of the data fragment is sent to all other connection nodes again;

and S46, after the fault node is restarted, obtaining cluster node fragment information from other nodes, and switching to the slave node according to the information and accessing the cluster.

Further, if the vote number in S44 is represented as y, the conditional expression that the slave node can perform master-slave switching is as follows:

wherein, the number of the management nodes is represented as m.

The invention has the beneficial effects that:

the method for constructing the large-scale redis cluster divides cluster nodes into management nodes and common nodes, wherein the management nodes are connected and communicated with all other nodes, and fault judgment and voting election rights of the nodes are carried out; the common nodes are only connected and communicated with the management nodes and the nodes in the master-slave relationship, and all cluster nodes are connected by adopting a small number of management nodes, so that the number of network connections to be established for communication among the cluster nodes can be greatly reduced; meanwhile, because a large number of common nodes only carry out Cluster communication with the management node, the network bandwidth occupied by communication among the Cluster nodes is effectively reduced, and the occupied host resources are reduced, so that the Redis Cluster Cluster of thousands of scale nodes can be supported.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a method of building a large-scale redis cluster according to an embodiment of the invention;

FIG. 2 is a network connection topology diagram of cluster nodes of a method for constructing a large-scale redis cluster according to an embodiment of the present invention;

FIG. 3 is a timing diagram of node failure determination of a method for constructing a large-scale redis cluster according to an embodiment of the present invention;

FIG. 4 is a timing diagram illustrating master-slave failure elections in a method for constructing a large-scale redis cluster according to an embodiment of the present invention;

fig. 5 is a diagram of an existing cluster construction method.

Detailed Description

For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.

According to the embodiment of the invention, the method for constructing the large-scale redis cluster is provided, and based on the gossip protocol, the cluster scale constructed by the method can reach thousands of nodes.

Referring now to the drawings and the detailed description, as shown in fig. 1-2, a method for constructing a large-scale redis (open-source key-value data caching system with a very wide application) cluster according to an embodiment of the present invention includes the following steps:

s1, selectively establishing communication connection with other remote dictionary service nodes according to different attributes of the remote dictionary service (redis) nodes; (Cluster communication connection)

S2, synchronizing information among clusters through the established communication connection and the communication connection module based on the cluster nodes, and finally enabling the fragment information in each node in the cluster to be consistent; (Cluster slice information synchronization)

as shown in fig. 3, if the node in the cluster is hung up in S3, the negotiation determination of the failure by each management node for the node further includes the following steps:

s32, if the connection between the management node and each node is failed to be established, the reconnection is carried out at regular time until the connection is established; (management node needs to be connected to all other nodes so that the surviving status of each node can be detected)

The expression of the total number of connections finally established by the management node can be expressed by the following logic:

e (m, n) = (m-1 + n) × m, where the number of management nodes is represented as m and the number of general nodes is represented as n.

s34, if the connection between the common node and other nodes fails to be established, the reconnection is carried out at regular time until the connection is established; (the common node only needs to establish connection with the management node and the master-slave node, thus greatly reducing the connection quantity of the cluster and the occupation of network bandwidth)

The expression of the total number of connections finally established by the common node can be expressed by the following logic:

and m is less than or equal to F (m, n, r) is less than or equal to (m + r-1) x n, wherein the number of management nodes is represented as m, the number of common nodes is represented as n, and the number of data copies is represented as r (namely the number of master and slave nodes contained in each data fragment).

S35, each node performs one-time heartbeat communication with all connected nodes within a timeout period t1, the content of a heartbeat request packet comprises fragment information of the node and state information of 10% of the nodes, and the initiation time of the heartbeat request packet of the heartbeat communication is recorded; (i.e., using the concept of Gossip protocol: a node wants to share some information with other nodes in the cluster. nodes are selected randomly and passed on periodically

Wherein, in S35, each node performs one heartbeat communication with all connected nodes within a timeout period t1, and records the initiation time of the heartbeat request packet of this heartbeat communication, further including the following steps:

s351, if the heartbeat request packet of each node in the step S35 does not receive the heartbeat response packet of the step S36 within the over-half timeout period t1/2, disconnecting the node, and reestablishing the connection in the steps S31-S34; (the connection is reestablished without receiving the response packet in the period t1/2, the misjudgment of the node state abnormity caused by the network quality can be avoided)

And S352, if the heartbeat request packet in the step S35 of each node does not receive the heartbeat response packet in the step S36 within the timeout period t1, marking the node as a suspected fault (PFAIL) state.

S36, after receiving the heartbeat request packet of other nodes in the step S35, each node checks the state of each node and replies a heartbeat response packet, wherein the content of the heartbeat response packet comprises the fragment information of the node and the state information of 10% of the nodes; (step S35, the concept of Gossip protocol)

Wherein, after each node in S36 receives the heartbeat request packet of another node in step S35, the method checks the status of each node included in the heartbeat request packet, and returns a heartbeat response packet further includes the following steps:

s361, if the sending node is a management node and there is a node marked as a suspected fault (PFAIL) state in the heartbeat request packet, adding the node in the suspected fault state to the suspected fault report list of the node recorded in the node;

s362, if the node is a management node and the report number of the node in the suspected fault state exceeds half of the management nodes, modifying the node state in the suspected fault state from a suspected fault (PFAIL) to a Fault (FAIL), simultaneously broadcasting and notifying all other surviving nodes, and marking the fault node state as a Fault (FAIL);

and if the report number of the suspected fault node recorded by the management node is represented as x, the conditional expression for judging the node Fault (FAIL) can be represented by the following logic:

wherein, the number of the management nodes is represented as m.

S363, if the sending node is a management node and there is a node marked as a Failure (FAIL) state in the heartbeat request packet, marking the corresponding node state recorded in the node as a Failure (FAIL);

s365, if the node is a common node and the state of the node is a Fault (FAIL), the state of the node needs to be recovered to be normal;

and S37, after receiving the heartbeat response packet of the other node in the step S36, each node resets the initiation time of the heartbeat request packet to 0 and processes the state of each node in the heartbeat request packet. The steps of the process are the same as steps 361-365. (reset request packet initiation time is 0, which indicates detection is complete, no timeout determination is made, and heartbeat detection can be performed again in the next cycle)

S4, if the slave node initiates election voting, each management node votes and selects a new master node, and the new master node replaces the failed master node; (S3 and S4 are selected for the Cluster node)

As shown in fig. 4, if the slave node initiates election voting in S4, each management node votes and selects a new master node, and the new master node replaces the failed master node, further includes the following steps:

s41, when the master node state of the slave node is Failure (FAIL), the slave node sends a master-slave failure switching voting request to all management nodes;

s42, after receiving the failover voting request in step S41, if the master node has not voted for the failure in the voting period 2 × t1, the management node replies a response packet to the slave node initiating the voting; (within each voting period, the management node can only respond once to the switching voting of a certain main node, thereby controlling at most one slave node to obtain enough votes)

s44, if the voting count exceeds half of the management nodes, the slave node switches itself to the master node and informs all other connection nodes by broadcasting, the data fragment is switched from master to slave, and one management node is selected for secondary broadcasting notification; (Master-slave switching can only be done if more than half of the management nodes are voted, so only one slave node will become the new master node)

And the number of votes is represented as y, the conditional expression that the slave node can perform master-slave switching is as follows:

wherein, the number of the management nodes is represented as m.

S45, after each node receives the master-slave switching notice in the step S44, the node updates the slave node corresponding to the data fragment new master node, the original master node is updated to be the slave node, and if the node is a management node needing secondary broadcasting, the master-slave switching notice of the data fragment is sent to all other connection nodes again; (since the management node will connect all other nodes in the cluster, the secondary broadcast is performed through the management node, compared with Gossip propagation, the speed of synchronizing the master-slave switching information to all nodes in the cluster can be improved)

In summary, in the method for constructing a large-scale redis cluster of the present invention, the cluster nodes are divided into the management nodes and the common nodes, wherein the management nodes are connected to and communicate with all other nodes, and perform fault determination and voting right of the nodes; the common nodes are only connected and communicated with the management nodes and the nodes in the master-slave relationship, and all cluster nodes are connected by adopting a small number of management nodes, so that the number of network connections to be established for communication among the cluster nodes can be greatly reduced; meanwhile, because a large number of common nodes only carry out Cluster communication with the management node, the network bandwidth occupied by communication among the Cluster nodes is effectively reduced, and the occupied host resources are reduced, so that the Redis Cluster Cluster of thousands of scale nodes can be supported.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of constructing a large-scale redis cluster, the method comprising the steps of:

2. The method according to claim 1, wherein if a node in the cluster hangs up in S3, the negotiation judgment of the node by each management node for failure further comprises the following steps:

3. The method for building a large-scale redis cluster according to claim 2, wherein the total number of connections finally established by the management node is expressed as:

4. The method for constructing a large-scale redis cluster according to claim 3, wherein the total number of connections finally established by the regular nodes is expressed as:

m×n≤F（m，n，r）≤（m+r-1）×n；

the number of management nodes is represented as m, the number of common nodes is represented as n, and the number of data copies is represented as r.

5. The method of claim 4, wherein the contents of the heartbeat request packet and the heartbeat response packet each include fragmentation information of the node and status information of 10% of the nodes.

6. The method of claim 5, wherein each node in the S35 performs a heartbeat communication with all connected nodes within a timeout period t1, and records an initiation time of a heartbeat request packet of the heartbeat communication, further comprising the following steps:

s351, if the heartbeat request packet of each node in the step S35 does not receive the heartbeat response packet in the step S36 within the over-half timeout period t/2, disconnecting the node, and reestablishing the connection in the steps S31 to S34;

7. The method of claim 6, wherein the nodes in S36 check the states of the nodes after receiving the heartbeat request packet from other nodes in S35, and reply to the heartbeat response packet further comprises the following steps:

8. The method of claim 7, wherein the number of reports of suspected failed nodes recorded by the management node is represented by x, and the conditional expression for determining node failure is as follows:

wherein, the number of the management nodes is represented as m.

9. The method according to claim 1, wherein in S4, if the slave node initiates election voting, each management node votes and elects a new master node, and the new master node replaces the failed master node, further comprising:

10. The method for constructing a large-scale redis cluster according to claim 9, wherein if the vote number in S44 is represented as y, the conditional expression that the slave node can perform master-slave switching is as follows:

wherein, the number of the management nodes is represented as m.