WO2016150066A1

WO2016150066A1 - Master node election method and apparatus, and storage system

Info

Publication number: WO2016150066A1
Application number: PCT/CN2015/086169
Authority: WO
Inventors: 陈正华; 郭斌; 陈典强; 韩银俊
Original assignee: 中兴通讯股份有限公司
Priority date: 2015-03-25
Filing date: 2015-08-05
Publication date: 2016-09-29
Also published as: CN106161495A

Abstract

The present invention provides a master node election method and apparatus, and a storage system. The master node election method comprises: a first node in a replication group establishes a connection with other nodes in the replication group, the first node determines whether a master node exists among the other nodes, if not, the first node sends an election request message to the other nodes, after receiving the election request message, the other nodes reply an election result according to an election policy, and the first node determines, according to the election result, whether to be switched to the master node. There is always only one master node in a replication group in a distributed storage system, thereby avoiding the problem of data inconsistency between nodes.

Description

Primary node election method, device and storage system

Technical field

The present invention relates to the field of distributed storage, and in particular, to a primary node election method, apparatus, and storage system.

Background technique

Cloud Computing is Grid Computing, Distributed Computing, Parallel Computing, Utility Computing Network Storage Technologies, Virtualization, The integration of traditional computer technology and network technology such as load balancing (Load Balance). It aims to integrate multiple relatively low-cost computing entities into a system with powerful computing power through the network. Distributed storage is an area in the field of cloud computing. Its role is to provide distributed data storage services for massive data and high-speed read and write access.

Nodes in a distributed storage system are stateful, that is, the data stored on each node may be different, and nodes cannot be easily replaced with each other. Therefore, its disaster recovery processing is also more complicated. In the distributed storage system of the master-slave architecture, the master node usually provides read and write services, and synchronizes data to the slave nodes in real time. When the master node fails, it switches to the slave nodes for reading and writing, thereby achieving disaster recovery purposes. The handover process requires that two or more master nodes cannot be generated under any circumstances to avoid data consistency problems. At the same time, the handover should be completed as soon as possible to reduce the system failure time.

In the traditional system scale and network environment, the above problem solving is relatively easy. Usually, the system is inside the same switch, that is, the topology is a star network, and each node has only one network exit. If the primary node experiences a network failure, neither the secondary node nor the application server can connect to the node. Therefore, only the slave node needs to monitor the status of the master node. When the master node is unavailable, it automatically switches states and takes over the read and write requests. Correspondingly, the application server can also detect the master node failure and transfer to the slave node for reading and writing.

In a distributed storage system, the system size is usually much larger. There are multiple writers accessing the data storage system through different networks. The reliability of the network connection is also greatly reduced, and network partitioning may occur. As shown in FIG. 1, when the slave node detects that the master node is unreachable, the master node may still work normally. At this time, if the slave node automatically switches to the active state, two or more master nodes will be generated. At this time, when the application server detects the multi-master node, the write will stop, and the related service will be interrupted or the node. There is a problem with the consistency of the data.

Summary of the invention

In order to solve the problem that data consistency between nodes in the related art is difficult to guarantee, the present invention provides a method, an apparatus, and a storage system for selecting a master node to solve at least the above problems.

According to an aspect of the embodiments of the present invention, a primary node election method is provided, including: a first node in a replication group establishes a connection with other nodes in a replication group; and the first node determines whether a primary node exists in the other node. Node; if it does not exist, And the first node sends an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines whether to switch to the primary node according to the election result. .

Optionally, the other node replies to the election result according to the election policy, including at least one of the following: if the election request message received by the other node is the first request message in the preset time period, the other The node replies to the consent message, where the first node switches to the master node if the number of the consent message reaches the preset threshold; the other node according to the data information carried in the election request message and the election policy And returning a weight value, where the sum of the weight values of the first node is the largest among all the nodes, the first node is switched to be a master node; if the master node exists in the other node, the other The node replies with a rejection message.

Optionally, before the first node sends the election request message to the other node, the method includes: sending an election request message to the other node if the priority of the first node meets an election condition, where The priority is preset.

Optionally, the first node in the replication group establishes a connection with other nodes in the replication group, and the number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.

Optionally, the replication groups are at least two and are all disposed on the same physical server.

Optionally, the number of nodes in the replication group is at least three.

According to another aspect of the embodiments of the present invention, a primary node election apparatus is provided, which is disposed on the first node, and includes: a connection module configured to establish a connection with other nodes in the replication group; and a query module configured to determine Whether the primary node exists in the other node; the election requesting module is configured to send an election request message to the other node if the primary node does not exist in the other node, and the election request message is used for the other The node replies to the election result according to the election policy; and the switching module is configured to determine whether to switch the first node to the primary node according to the election result.

Optionally, the switching module includes at least one of the following: a first triggering unit, configured to switch the first node to a primary node if the number of the consent messages reaches a preset threshold, where The consent message is that the other node replies if the received election request message is the first request message within the preset time period; and the second trigger unit sets the sum of the weight values of the first node to be The largest of all the nodes, the first node is switched to the master node, wherein the weight value is replied by the other node according to the data information carried in the election request message and the election policy; If there is a master node in the other nodes, the other node replies to the reject message.

Optionally, the election request module includes: a priority unit, configured to trigger the election request module to send an election request message to the other node if the priority of the first node meets an election condition, Wherein, the priority is preset.

Optionally, the number of nodes in the replication group is at least three.

According to still another aspect of the embodiments of the present invention, a storage system is provided, where the storage system includes at least one replication group, and the replication group includes:

The first node, including:

a connection module, configured to establish a connection with other nodes in the replication group; a query module, configured to determine whether the other node has a primary node; and an election request module, configured to not have a primary node in the other node, Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module is configured to determine whether to switch to the primary node according to the election result;

The other node is configured to reply to the election result according to the election request message and the election policy.

According to the embodiment of the present invention, the first node in the replication group is used to establish a connection with other nodes in the replication group; the first node determines whether the other node has a primary node; if not, the first node The other node sends an election request message, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines, according to the election result, whether to switch to the primary node, so that the distributed storage is performed. The primary node in the replication group in the system is always kept at one time, which avoids the problem of data consistency between nodes.

DRAWINGS

The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:

1 is a schematic diagram of a network partition in a distributed system according to an embodiment of the present invention;

2 is a flowchart 1 of a method for electing a master node according to an embodiment of the present invention;

3 is a second flowchart of a method for electing a master node according to an embodiment of the present invention;

4 is a structural block diagram 1 of a primary node election apparatus according to an embodiment of the present invention;

FIG. 5 is a second structural block diagram of a primary node election apparatus according to an embodiment of the present invention; FIG.

6 is a schematic diagram of a storage system application according to an embodiment of the present invention.

detailed description

The invention will be described in detail below with reference to the drawings in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.

The embodiment of the present invention provides a primary node election method. FIG. 2 is a flowchart 1 of a primary node election method according to an embodiment of the present invention. As shown in FIG. 2, the method includes:

S202. The first node in the replication group establishes a connection with other nodes in the replication group, where each node in the replication group is connected to each other and stores the same data, and the replication group can pass node redundancy and related scheduling algorithms. Guarantee the availability and consistency of the stored data. A distributed storage system can be composed of one or more such replication groups and served to the application. The server provides a data storage service.

S204: The first node determines whether a primary node exists in other nodes in the replication group, and may adopt a method of polling all node identities to find and determine whether a primary node exists.

S206. If the primary node does not exist, the first node sends an election request message to other nodes in the replication group, and the other nodes reply to the election result according to the corresponding election policy after receiving the election request message.

S208. The first node determines, according to the election result replied by the other node, whether it switches to the master node.

Wherein, if it is found in the judging process that a master node already exists, the first node remains as the slave node identity. And the data real-time synchronization process can be started, and the status of the master node is monitored, and step S204 is re-executed when the master node is found to be invalid.

According to the embodiment of the present invention, the first node in the replication group establishes a connection with other nodes in the replication group, and the first node determines whether the primary node exists in the other node. If not, the first node sends an election request message to other nodes. After receiving the election request message, the other node replies to the election result according to the election policy, and the first node determines whether to switch to the primary node according to the election result. The primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.

In an embodiment of the present invention, the other node may reply to the election result according to the election policy after receiving the election request message, and may include one of the following methods:

If the election request message received by the other node is the first request message in the preset time period, the other node replies to the consent message, and when the number of the agreed message returned reaches a preset threshold, the first node switches to Master node

The other node returns the weight value according to the data information carried in the election request message and the election policy, wherein the data information includes the basis for the other node to reply the weight value, and the basis may be a preset operation logic. When the sum of the above-mentioned ownership weights of the first node is the largest among all the nodes, the first node switches to the master node.

In the first manner described above, when the initial startup or the primary node fails, the secondary node in the replication group cannot connect to the primary node, and an election process is triggered until it becomes a new primary node or finds the primary node. After the primary node is elected, the primary node is responsible for processing the data read and write requests of the application server. FIG. 3 is a second flowchart of a method for electing a master node according to an embodiment of the present invention. As shown in FIG. 3, the method includes:

S302, the slave node A can be regarded as the first node in the present invention, and the slave node B and C are equivalent to the other nodes mentioned above. In this embodiment, the slave node A cannot connect to an existing master node in the copy group. Broadcasting the election request internally, requesting that the node B and C agree to become the new master node;

S304, performing a similar operation from the node B, sending an election request to the slave nodes A and C, requesting to become the master node;

S306. In the preset time period T, the slave node C first receives the election request from the node A, and cannot connect to the known master node by itself, and agrees to the election request from the node A;

S308, receiving an election request from the node B from the node C, because the election request from the node A has been agreed, the node C replies with the rejection message, rejecting the election request from the node B;

S310. The node B receives the election request of the node A, and rejects the election request from the node A because it has sent an election request (ie, agrees to become the new master node);

S312. The node A receives an election request from the node B, and rejects the election request from the node B because it has sent an election request (ie, agrees to become the new master node);

S314. If the preset threshold is more than half, the node A collects more than half of the election consent feedbacks, becomes a new master node, and completes switching of its own service state.

S316, the node B fails to collect more than half of the election consent feedback, so the election fails, and the main node search process is re-executed;

S318, the nodes B and C are discovered and connected to the master node A, and the initialization of the replication group is completed, and the election is terminated.

For the election request described in steps S302 and S304, there is a timeout period t. If the sender fails to receive a reply from a node within the time t, it is equivalent to the node rejecting the ticket. Setting the timeout period t can prevent the request from being unreachable or lost due to network reasons.

As described in steps S306, S308, S310, and S312, when a slave node receives an election request sent by another node, if it is already connected to a master node, the election request should be rejected; otherwise, it should be guaranteed to time out. Within time t, only one election request from the slave node is agreed.

In actual situations, since the request transmission of each node is parallel, the specific timing may be different from the above steps, but the processing manner is the same. In the embodiment of the present invention, if more than half of the nodes in the replication group can communicate with each other, after one or more elections, one master node can always be selected to complete the initialization process of the replication group.

In an embodiment of the present invention, in order to avoid conflicts in which multiple nodes initiate elections at the same time, a default election initiation priority may be configured for the slave node, for example, when the first node is pre-configured as the second priority. If the first node wants to initiate the election process, the election conditions that must be met are: the slave node in the first priority has initiated the election process or is unable to connect.

In an embodiment of the present invention, since the load of the master and slave nodes in the replication group is different, the nodes in the replication group may be designed to be virtual, and two or more replication groups may be deployed on one physical server, and each physical server is simultaneously As the master node or slave node in these replication groups, the purpose of balancing the load can be achieved. Of course, the electoral process between different replication groups is independent of each other. The number of nodes in the replication group is generally three or more.

The embodiment of the present invention further provides a primary node election device, which is disposed on the first node. FIG. 4 is a structural block diagram of a primary node election device according to an embodiment of the present invention. As shown in FIG. 4, the device includes: a connection module. 402, configured to establish a connection with other nodes in the replication group; the query module 404 is configured to determine whether the other node has a primary node; and the election request module 406 is configured to not exist in the other node. Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module 408 is configured to determine, according to the election result, whether to switch the first node to Primary node.

If the query module 404 finds that a primary node already exists in the determining process, the first node remains as a slave. Node identity. And the data real-time synchronization process can be started, and the status of the master node is monitored at the same time, and the query module 404 re-executes the query function when the master node is found to be invalid.

With the embodiment of the present invention, the connection module 402 in the first node establishes a connection with other nodes in the replication group, and the query module 404 determines whether there is a primary node in the other node. If not, the election request module 406 sends an election request to other nodes. The message, after receiving the election request message, the other node replies to the election result according to the election policy, and the switching module 408 determines whether to switch to the master node according to the election result. The primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.

FIG. 5 is a block diagram 2 of a main node election apparatus according to an embodiment of the present invention. In an embodiment of the present invention, as shown in FIG. 5, the switching module 508 includes:

The first triggering unit 5004 is configured to switch the first node to the primary node if the number of the consent messages reaches the preset threshold, where the consent message is that the election request message received by the other node is a preset time period. Reply in case of the first request message;

The second triggering unit 5006 is configured to: when the sum of the weight values of the first node is the largest among all the nodes, the first node switches to the master node, where the weight value is data information carried by other nodes according to the election request message and Reply to the election strategy;

Wherein, in the case where the master node exists in other nodes, the other nodes reply to the reject message.

In an embodiment of the present invention, as shown in FIG. 5, the election request module 506 includes:

The priority unit 5002 is configured to trigger the election request module 506 to send an election request message to other nodes when the priority of the first node meets the election condition, wherein the priority is preset.

The embodiment of the present invention further provides a storage system. FIG. 6 is a schematic diagram of a storage system application according to an embodiment of the present invention. As shown in FIG. 6, the storage system includes at least one replication group: one primary node and two or more It consists of nodes, the same data is stored on the nodes in the group, and the availability and consistency of the stored data are guaranteed by the redundancy of the nodes and related scheduling algorithms. The entire storage system consists of one or more replication groups that provide data storage services to the application server.

The first node in the embodiment of the present invention becomes a master node after being elected, and is a node that provides data read/write service in the replication group, and is responsible for processing the read and write request sent by the application server, and synchronizing the stored data to the slave node.

The other nodes in the embodiment of the present invention are equivalent to the slave node in FIG. 6, which is the backup node of the master node, provides the same data access capability as the master node, and synchronizes data from the master node to keep the data state consistent. Like the first node, when these slave nodes detect that the master node is unavailable, elections can also be initiated to participate in the new master node selection process. And according to the election result, switch to the master node, or start data synchronization with the new master node.

The application server in Figure 6 is a node that deploys a user-specific application that uses the data read and write services provided by the storage system. Typically, the program reads and writes data through an interface library issued by the storage system, which is not itself part of the data storage system. To simplify the description, the present invention assumes that the interface library has the ability to distinguish between the master and slave nodes and automatically select the master node for data reading and writing. In fact, this feature can be implemented outside the system, ie by the user program.

The application server in the embodiment of the present invention is connected to all nodes in the replication group, and all nodes in the replication group are also connected to each other.

According to the embodiment of the present invention, the connection module of the first node in the storage system establishes a connection with other nodes in the replication group, and the query module determines whether there is a primary node in the other node. If not, the election request module sends an election request to other nodes. The message, after receiving the election request message, the other node replies to the election result according to the election policy, and the switching module determines whether to switch to the master node according to the election result. The primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.

It will be apparent to those skilled in the art that the various modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein. The steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software.

The above description is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Industrial applicability

According to the embodiment of the present invention, the first node in the replication group is used to establish a connection with other nodes in the replication group; the first node determines whether the other node has a primary node; if not, the first node The other node sends an election request message, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines whether to switch to the primary node according to the election result. The primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.

Claims

A method for electing a primary node includes: establishing a connection between a first node in a replication group and other nodes in a replication group;

Determining, by the first node, whether a primary node exists in the other nodes;

If not, the first node sends an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy;

The first node determines whether to switch to the primary node according to the election result.
The method of claim 1, wherein the other node replies to the election result according to an election policy, including at least one of the following:

If the election request message received by the other node is the first request message in the preset time period, the other node replies to the consent message, and if the number of the consent message reaches a preset threshold, the The first node is switched to be a primary node;

And the other node returns a weight value according to the data information carried in the election request message and the election policy, where the sum of the weight values of the first node is the largest among all nodes, the first node Switch to the master node;

In the case where there is a master node among the other nodes, the other nodes reply to the reject message.
The method of claim 1, wherein before the first node sends an election request message to the other node, the method includes:

And if the priority of the first node meets the election condition, sending an election request message to the other node, where the priority is preset.
The method according to any one of claims 1 to 3, wherein the first node in the replication group establishes a connection with other nodes in the replication group, including:

The number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
The method according to any of claims 4, wherein the copy groups are at least two and are all disposed on the same physical server.
The method of claim 5 wherein the number of nodes in the replication group is at least three.
A primary node election device is disposed on the first node, and includes:

a connection module, set to establish a connection with other nodes in the replication group;

a query module, configured to determine whether a primary node exists in the other nodes;

An election requesting module is configured to send an election request message to the other node if the primary node does not exist in the other node, where the election request message is used by the other node to reply to the election result according to the election policy;

And a switching module, configured to determine, according to the election result, whether to switch the first node to a primary node.
The apparatus of claim 7, wherein the switching module comprises at least one of:

a first triggering unit, configured to switch the first node to a primary node if the number of the consent messages reaches a preset threshold, where the consent message is the election that the other node is receiving The request message is replied in the case of the first request message within the preset time period;

a second triggering unit, configured to: when the sum of weight values of the first node is the largest among all nodes, the first node switches to a master node, where the weight value is determined by the other node according to the The data information carried in the election request message and the reply of the election policy;

Where the other node has a primary node, the other node replies to the rejection message.
The apparatus of claim 7, wherein the election request module comprises:

The priority unit is configured to trigger the election request module to send an election request message to the other node if the priority of the first node meets the election condition, wherein the priority is preset.
The apparatus according to any one of claims 7 to 9, wherein the first node in the replication group establishes a connection with other nodes in the replication group, including:

The number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
The apparatus according to any one of claims 10 to 10, wherein the copy groups are at least two and are each disposed on the same physical server.
The apparatus of claim 11 wherein the number of nodes in the replication group is at least three.
A storage system, the storage system comprising at least one replication group, the replication group comprising:

The first node, including:

a connection module, configured to establish a connection with other nodes in the replication group; a query module, configured to determine whether the other node has a primary node; and an election request module, configured to not have a primary node in the other node, Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module is configured to determine whether to switch to the primary node according to the election result;

The other node is configured to reply to the election result according to the election request message and the election policy.