WO2016150066A1 - Master node election method and apparatus, and storage system - Google Patents

Master node election method and apparatus, and storage system Download PDF

Info

Publication number
WO2016150066A1
WO2016150066A1 PCT/CN2015/086169 CN2015086169W WO2016150066A1 WO 2016150066 A1 WO2016150066 A1 WO 2016150066A1 CN 2015086169 W CN2015086169 W CN 2015086169W WO 2016150066 A1 WO2016150066 A1 WO 2016150066A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
election
nodes
request message
replication group
Prior art date
Application number
PCT/CN2015/086169
Other languages
French (fr)
Chinese (zh)
Inventor
陈正华
郭斌
陈典强
韩银俊
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016150066A1 publication Critical patent/WO2016150066A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs

Definitions

  • the present invention relates to the field of distributed storage, and in particular, to a primary node election method, apparatus, and storage system.
  • Cloud Computing is Grid Computing, Distributed Computing, Parallel Computing, Utility Computing Network Storage Technologies, Virtualization, The integration of traditional computer technology and network technology such as load balancing (Load Balance). It aims to integrate multiple relatively low-cost computing entities into a system with powerful computing power through the network.
  • Distributed storage is an area in the field of cloud computing. Its role is to provide distributed data storage services for massive data and high-speed read and write access.
  • Nodes in a distributed storage system are stateful, that is, the data stored on each node may be different, and nodes cannot be easily replaced with each other. Therefore, its disaster recovery processing is also more complicated.
  • the master node usually provides read and write services, and synchronizes data to the slave nodes in real time. When the master node fails, it switches to the slave nodes for reading and writing, thereby achieving disaster recovery purposes.
  • the handover process requires that two or more master nodes cannot be generated under any circumstances to avoid data consistency problems. At the same time, the handover should be completed as soon as possible to reduce the system failure time.
  • the above problem solving is relatively easy.
  • the system is inside the same switch, that is, the topology is a star network, and each node has only one network exit. If the primary node experiences a network failure, neither the secondary node nor the application server can connect to the node. Therefore, only the slave node needs to monitor the status of the master node. When the master node is unavailable, it automatically switches states and takes over the read and write requests.
  • the application server can also detect the master node failure and transfer to the slave node for reading and writing.
  • the system size is usually much larger. There are multiple writers accessing the data storage system through different networks. The reliability of the network connection is also greatly reduced, and network partitioning may occur.
  • the master node may still work normally. At this time, if the slave node automatically switches to the active state, two or more master nodes will be generated. At this time, when the application server detects the multi-master node, the write will stop, and the related service will be interrupted or the node. There is a problem with the consistency of the data.
  • the present invention provides a method, an apparatus, and a storage system for selecting a master node to solve at least the above problems.
  • a primary node election method including: a first node in a replication group establishes a connection with other nodes in a replication group; and the first node determines whether a primary node exists in the other node. Node; if it does not exist, And the first node sends an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines whether to switch to the primary node according to the election result. .
  • the other node replies to the election result according to the election policy, including at least one of the following: if the election request message received by the other node is the first request message in the preset time period, the other The node replies to the consent message, where the first node switches to the master node if the number of the consent message reaches the preset threshold; the other node according to the data information carried in the election request message and the election policy And returning a weight value, where the sum of the weight values of the first node is the largest among all the nodes, the first node is switched to be a master node; if the master node exists in the other node, the other The node replies with a rejection message.
  • the method before the first node sends the election request message to the other node, the method includes: sending an election request message to the other node if the priority of the first node meets an election condition, where The priority is preset.
  • the first node in the replication group establishes a connection with other nodes in the replication group, and the number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
  • the replication groups are at least two and are all disposed on the same physical server.
  • the number of nodes in the replication group is at least three.
  • a primary node election apparatus which is disposed on the first node, and includes: a connection module configured to establish a connection with other nodes in the replication group; and a query module configured to determine Whether the primary node exists in the other node; the election requesting module is configured to send an election request message to the other node if the primary node does not exist in the other node, and the election request message is used for the other The node replies to the election result according to the election policy; and the switching module is configured to determine whether to switch the first node to the primary node according to the election result.
  • the switching module includes at least one of the following: a first triggering unit, configured to switch the first node to a primary node if the number of the consent messages reaches a preset threshold, where The consent message is that the other node replies if the received election request message is the first request message within the preset time period; and the second trigger unit sets the sum of the weight values of the first node to be The largest of all the nodes, the first node is switched to the master node, wherein the weight value is replied by the other node according to the data information carried in the election request message and the election policy; If there is a master node in the other nodes, the other node replies to the reject message.
  • a first triggering unit configured to switch the first node to a primary node if the number of the consent messages reaches a preset threshold, where The consent message is that the other node replies if the received election request message is the first request message within the preset time period
  • the second trigger unit sets the sum of the weight values of the first node to be
  • the election request module includes: a priority unit, configured to trigger the election request module to send an election request message to the other node if the priority of the first node meets an election condition, Wherein, the priority is preset.
  • the first node in the replication group establishes a connection with other nodes in the replication group, and the number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
  • the replication groups are at least two and are all disposed on the same physical server.
  • the number of nodes in the replication group is at least three.
  • a storage system where the storage system includes at least one replication group, and the replication group includes:
  • the first node including:
  • connection module configured to establish a connection with other nodes in the replication group;
  • query module configured to determine whether the other node has a primary node;
  • election request module configured to not have a primary node in the other node, Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module is configured to determine whether to switch to the primary node according to the election result;
  • the other node is configured to reply to the election result according to the election request message and the election policy.
  • the first node in the replication group is used to establish a connection with other nodes in the replication group; the first node determines whether the other node has a primary node; if not, the first node The other node sends an election request message, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines, according to the election result, whether to switch to the primary node, so that the distributed storage is performed.
  • the primary node in the replication group in the system is always kept at one time, which avoids the problem of data consistency between nodes.
  • FIG. 1 is a schematic diagram of a network partition in a distributed system according to an embodiment of the present invention
  • FIG. 2 is a flowchart 1 of a method for electing a master node according to an embodiment of the present invention
  • FIG. 3 is a second flowchart of a method for electing a master node according to an embodiment of the present invention.
  • FIG. 4 is a structural block diagram 1 of a primary node election apparatus according to an embodiment of the present invention.
  • FIG. 5 is a second structural block diagram of a primary node election apparatus according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a storage system application according to an embodiment of the present invention.
  • FIG. 2 is a flowchart 1 of a primary node election method according to an embodiment of the present invention. As shown in FIG. 2, the method includes:
  • the first node in the replication group establishes a connection with other nodes in the replication group, where each node in the replication group is connected to each other and stores the same data, and the replication group can pass node redundancy and related scheduling algorithms. Guarantee the availability and consistency of the stored data.
  • a distributed storage system can be composed of one or more such replication groups and served to the application.
  • the server provides a data storage service.
  • the first node determines whether a primary node exists in other nodes in the replication group, and may adopt a method of polling all node identities to find and determine whether a primary node exists.
  • the first node sends an election request message to other nodes in the replication group, and the other nodes reply to the election result according to the corresponding election policy after receiving the election request message.
  • the first node determines, according to the election result replied by the other node, whether it switches to the master node.
  • step S204 is re-executed when the master node is found to be invalid.
  • the first node in the replication group establishes a connection with other nodes in the replication group, and the first node determines whether the primary node exists in the other node. If not, the first node sends an election request message to other nodes. After receiving the election request message, the other node replies to the election result according to the election policy, and the first node determines whether to switch to the primary node according to the election result.
  • the primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
  • the other node may reply to the election result according to the election policy after receiving the election request message, and may include one of the following methods:
  • the election request message received by the other node is the first request message in the preset time period
  • the other node replies to the consent message
  • the number of the agreed message returned reaches a preset threshold
  • the other node returns the weight value according to the data information carried in the election request message and the election policy, wherein the data information includes the basis for the other node to reply the weight value, and the basis may be a preset operation logic.
  • the first node switches to the master node.
  • FIG. 3 is a second flowchart of a method for electing a master node according to an embodiment of the present invention. As shown in FIG. 3, the method includes:
  • the slave node A can be regarded as the first node in the present invention, and the slave node B and C are equivalent to the other nodes mentioned above.
  • the slave node A cannot connect to an existing master node in the copy group. Broadcasting the election request internally, requesting that the node B and C agree to become the new master node;
  • the slave node C first receives the election request from the node A, and cannot connect to the known master node by itself, and agrees to the election request from the node A;
  • the node B receives the election request of the node A, and rejects the election request from the node A because it has sent an election request (ie, agrees to become the new master node);
  • the node A receives an election request from the node B, and rejects the election request from the node B because it has sent an election request (ie, agrees to become the new master node);
  • the node A collects more than half of the election consent feedbacks, becomes a new master node, and completes switching of its own service state.
  • the node B fails to collect more than half of the election consent feedback, so the election fails, and the main node search process is re-executed;
  • timeout period t For the election request described in steps S302 and S304, there is a timeout period t. If the sender fails to receive a reply from a node within the time t, it is equivalent to the node rejecting the ticket. Setting the timeout period t can prevent the request from being unreachable or lost due to network reasons.
  • steps S306, S308, S310, and S312 when a slave node receives an election request sent by another node, if it is already connected to a master node, the election request should be rejected; otherwise, it should be guaranteed to time out. Within time t, only one election request from the slave node is agreed.
  • the specific timing may be different from the above steps, but the processing manner is the same.
  • one master node can always be selected to complete the initialization process of the replication group.
  • a default election initiation priority may be configured for the slave node, for example, when the first node is pre-configured as the second priority. If the first node wants to initiate the election process, the election conditions that must be met are: the slave node in the first priority has initiated the election process or is unable to connect.
  • the nodes in the replication group may be designed to be virtual, and two or more replication groups may be deployed on one physical server, and each physical server is simultaneously As the master node or slave node in these replication groups, the purpose of balancing the load can be achieved.
  • the electoral process between different replication groups is independent of each other.
  • the number of nodes in the replication group is generally three or more.
  • FIG. 4 is a structural block diagram of a primary node election device according to an embodiment of the present invention.
  • the device includes: a connection module. 402, configured to establish a connection with other nodes in the replication group; the query module 404 is configured to determine whether the other node has a primary node; and the election request module 406 is configured to not exist in the other node. Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module 408 is configured to determine, according to the election result, whether to switch the first node to Primary node.
  • the query module 404 finds that a primary node already exists in the determining process, the first node remains as a slave. Node identity. And the data real-time synchronization process can be started, and the status of the master node is monitored at the same time, and the query module 404 re-executes the query function when the master node is found to be invalid.
  • the connection module 402 in the first node establishes a connection with other nodes in the replication group, and the query module 404 determines whether there is a primary node in the other node. If not, the election request module 406 sends an election request to other nodes. The message, after receiving the election request message, the other node replies to the election result according to the election policy, and the switching module 408 determines whether to switch to the master node according to the election result.
  • the primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
  • FIG. 5 is a block diagram 2 of a main node election apparatus according to an embodiment of the present invention.
  • the switching module 508 includes:
  • the first triggering unit 5004 is configured to switch the first node to the primary node if the number of the consent messages reaches the preset threshold, where the consent message is that the election request message received by the other node is a preset time period. Reply in case of the first request message;
  • the second triggering unit 5006 is configured to: when the sum of the weight values of the first node is the largest among all the nodes, the first node switches to the master node, where the weight value is data information carried by other nodes according to the election request message and Reply to the election strategy;
  • the other nodes reply to the reject message.
  • the election request module 506 includes:
  • the priority unit 5002 is configured to trigger the election request module 506 to send an election request message to other nodes when the priority of the first node meets the election condition, wherein the priority is preset.
  • FIG. 6 is a schematic diagram of a storage system application according to an embodiment of the present invention.
  • the storage system includes at least one replication group: one primary node and two or more It consists of nodes, the same data is stored on the nodes in the group, and the availability and consistency of the stored data are guaranteed by the redundancy of the nodes and related scheduling algorithms.
  • the entire storage system consists of one or more replication groups that provide data storage services to the application server.
  • the first node in the embodiment of the present invention becomes a master node after being elected, and is a node that provides data read/write service in the replication group, and is responsible for processing the read and write request sent by the application server, and synchronizing the stored data to the slave node.
  • the other nodes in the embodiment of the present invention are equivalent to the slave node in FIG. 6, which is the backup node of the master node, provides the same data access capability as the master node, and synchronizes data from the master node to keep the data state consistent.
  • the slave nodes Like the first node, when these slave nodes detect that the master node is unavailable, elections can also be initiated to participate in the new master node selection process. And according to the election result, switch to the master node, or start data synchronization with the new master node.
  • the application server in Figure 6 is a node that deploys a user-specific application that uses the data read and write services provided by the storage system.
  • the program reads and writes data through an interface library issued by the storage system, which is not itself part of the data storage system.
  • the interface library has the ability to distinguish between the master and slave nodes and automatically select the master node for data reading and writing. In fact, this feature can be implemented outside the system, ie by the user program.
  • the application server in the embodiment of the present invention is connected to all nodes in the replication group, and all nodes in the replication group are also connected to each other.
  • the connection module of the first node in the storage system establishes a connection with other nodes in the replication group, and the query module determines whether there is a primary node in the other node. If not, the election request module sends an election request to other nodes. The message, after receiving the election request message, the other node replies to the election result according to the election policy, and the switching module determines whether to switch to the master node according to the election result.
  • the primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
  • the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.
  • the first node in the replication group is used to establish a connection with other nodes in the replication group; the first node determines whether the other node has a primary node; if not, the first node The other node sends an election request message, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines whether to switch to the primary node according to the election result.
  • the primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a master node election method and apparatus, and a storage system. The master node election method comprises: a first node in a replication group establishes a connection with other nodes in the replication group, the first node determines whether a master node exists among the other nodes, if not, the first node sends an election request message to the other nodes, after receiving the election request message, the other nodes reply an election result according to an election policy, and the first node determines, according to the election result, whether to be switched to the master node. There is always only one master node in a replication group in a distributed storage system, thereby avoiding the problem of data inconsistency between nodes.

Description

一种主节点选举方法、装置及存储系统Primary node election method, device and storage system 技术领域Technical field
本发明涉及分布式存储领域,尤其涉及一种主节点选举方法、装置及存储系统。The present invention relates to the field of distributed storage, and in particular, to a primary node election method, apparatus, and storage system.
背景技术Background technique
云计算(Cloud Computing)是网格计算(Grid Computing)、分布式计算(Distributed Computing)、并行计算(Parallel Computing)、效用计算(Utility Computing)网络存储(Network Storage Technologies)、虚拟化(Virtualization)、负载均衡(Load Balance)等传统计算机技术和网络技术发展融合的产物。它旨在通过网络把多个成本相对较低的计算实体整合成一个具有强大计算能力的系统。分布式存储是云计算范畴中的一个领域,其作用是提供海量数据的分布式数据存储服务以及高速读写访问的能力。Cloud Computing is Grid Computing, Distributed Computing, Parallel Computing, Utility Computing Network Storage Technologies, Virtualization, The integration of traditional computer technology and network technology such as load balancing (Load Balance). It aims to integrate multiple relatively low-cost computing entities into a system with powerful computing power through the network. Distributed storage is an area in the field of cloud computing. Its role is to provide distributed data storage services for massive data and high-speed read and write access.
分布式存储系统中的节点都是有状态的,即每个节点上存储的数据可能不同,节点之间无法简单的互相替换。因此,其容灾处理也更为复杂。在主从架构的分布式存储系统中,通常由主节点提供读写服务,并实时向从节点同步数据,在主节点故障时,切换到从节点进行读写,从而达到容灾目的。切换过程要求任何情况下不能产生两个及两个以上的主节点,以避免数据一致性出现问题,同时,要求切换应尽快完成,降低系统故障时间。Nodes in a distributed storage system are stateful, that is, the data stored on each node may be different, and nodes cannot be easily replaced with each other. Therefore, its disaster recovery processing is also more complicated. In the distributed storage system of the master-slave architecture, the master node usually provides read and write services, and synchronizes data to the slave nodes in real time. When the master node fails, it switches to the slave nodes for reading and writing, thereby achieving disaster recovery purposes. The handover process requires that two or more master nodes cannot be generated under any circumstances to avoid data consistency problems. At the same time, the handover should be completed as soon as possible to reduce the system failure time.
在传统系统规模和网络环境下,上述问题解决相对容易。通常,系统处于同一交换机内部,即拓扑结构为星型网络,每个节点只有一个网络出口。如果主节点出现网络故障,则从节点和应用服务器均无法连接到该节点。因此,只需要由从节点监测主节点状态,当主节点不可用时,自动切换状态,接管读写请求。相应的,应用服务器也能检测到主节点故障,并转移到从节点进行读写。In the traditional system scale and network environment, the above problem solving is relatively easy. Usually, the system is inside the same switch, that is, the topology is a star network, and each node has only one network exit. If the primary node experiences a network failure, neither the secondary node nor the application server can connect to the node. Therefore, only the slave node needs to monitor the status of the master node. When the master node is unavailable, it automatically switches states and takes over the read and write requests. Correspondingly, the application server can also detect the master node failure and transfer to the slave node for reading and writing.
在分布式存储系统中,系统规模通常要大很多,会存在多个写入者通过不同网络访问数据存储系统的情形,网络连接的可靠性也大大降低,可能出现网络分区的情形。如附图1所示,当从节点检测到主节点不可达时,主节点可能仍然工作正常。此时,如果从节点自动切换到主用状态,将产生两个及两个以上的主节点,此时应用服务器检测到了多主节点的情况就会停止写入,相关的服务就会中断或者节点之间的数据一致性出现问题。In a distributed storage system, the system size is usually much larger. There are multiple writers accessing the data storage system through different networks. The reliability of the network connection is also greatly reduced, and network partitioning may occur. As shown in FIG. 1, when the slave node detects that the master node is unreachable, the master node may still work normally. At this time, if the slave node automatically switches to the active state, two or more master nodes will be generated. At this time, when the application server detects the multi-master node, the write will stop, and the related service will be interrupted or the node. There is a problem with the consistency of the data.
发明内容Summary of the invention
为了解决相关技术中节点之间的数据一致性难以保证的问题,本发明提供了一种主节点选举方法、装置及存储系统,以至少解决上述问题。In order to solve the problem that data consistency between nodes in the related art is difficult to guarantee, the present invention provides a method, an apparatus, and a storage system for selecting a master node to solve at least the above problems.
根据本发明实施例的一个方面,提供了一种主节点选举方法,包括:复制组中的第一节点与复制组中其他节点建立连接;所述第一节点判断所述其他节点中是否存在主节点;若不存在, 则所述第一节点向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;所述第一节点根据所述选举结果确定是否切换为主节点。According to an aspect of the embodiments of the present invention, a primary node election method is provided, including: a first node in a replication group establishes a connection with other nodes in a replication group; and the first node determines whether a primary node exists in the other node. Node; if it does not exist, And the first node sends an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines whether to switch to the primary node according to the election result. .
可选地,所述其他节点根据选举策略回复选举结果,包括以下至少之一:所述其他节点接收的所述选举请求消息是预设时间段内第一个请求消息的情况下,所述其他节点回复同意消息,在所述同意消息的数量达到预设阈值的情况下,所述第一节点切换为主节点;所述其他节点根据所述选举请求消息中携带的数据信息以及所述选举策略,回复权重值,所述第一节点的所述权重值总和在所有节点中最大的情况下,所述第一节点切换为主节点;所述其他节点中存在主节点的情况下,所述其他节点回复拒绝消息。Optionally, the other node replies to the election result according to the election policy, including at least one of the following: if the election request message received by the other node is the first request message in the preset time period, the other The node replies to the consent message, where the first node switches to the master node if the number of the consent message reaches the preset threshold; the other node according to the data information carried in the election request message and the election policy And returning a weight value, where the sum of the weight values of the first node is the largest among all the nodes, the first node is switched to be a master node; if the master node exists in the other node, the other The node replies with a rejection message.
可选地,所述第一节点向所述其他节点发送选举请求消息之前,包括:在所述第一节点的优先级符合选举条件的情况下,向所述其他节点发送选举请求消息,其中,所述优先级是预先设定的。Optionally, before the first node sends the election request message to the other node, the method includes: sending an election request message to the other node if the priority of the first node meets an election condition, where The priority is preset.
可选地,所述复制组中的第一节点与复制组中其他节点建立连接,包括:所述复制组中与所述第一节点建立连接的节点数量达到预设阈值。Optionally, the first node in the replication group establishes a connection with other nodes in the replication group, and the number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
可选地,所述复制组为至少两个,且均设置在同一个物理服务器上。Optionally, the replication groups are at least two and are all disposed on the same physical server.
可选地,所述复制组中的节点数量为至少三个。Optionally, the number of nodes in the replication group is at least three.
根据本发明实施实例的另一个方面,提供了一种主节点选举装置,设置于所述第一节点上,包括:连接模块,设置为与复制组中其他节点建立连接;查询模块,设置为判断所述其他节点中是否存在主节点;选举请求模块,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;切换模块,设置为根据所述选举结果确定是否将所述第一节点切换为主节点。According to another aspect of the embodiments of the present invention, a primary node election apparatus is provided, which is disposed on the first node, and includes: a connection module configured to establish a connection with other nodes in the replication group; and a query module configured to determine Whether the primary node exists in the other node; the election requesting module is configured to send an election request message to the other node if the primary node does not exist in the other node, and the election request message is used for the other The node replies to the election result according to the election policy; and the switching module is configured to determine whether to switch the first node to the primary node according to the election result.
可选地,所述切换模块包括以下至少之一:第一触发单元,设置为在所述同意消息的数量达到预设阈值的情况下,将所述第一节点切换为主节点,其中,所述同意消息是所述其他节点在接收的所述选举请求消息是预设时间段内第一个请求消息的情况下回复的;第二触发单元,设置为所述第一节点的权重值总和在所有节点中最大的情况下,所述第一节点切换为主节点,其中,所述权重值是由所述其他节点根据所述选举请求消息中携带的数据信息以及所述选举策略回复的;其中,所述其他节点中存在主节点的情况下,所述其他节点回复拒绝消息。Optionally, the switching module includes at least one of the following: a first triggering unit, configured to switch the first node to a primary node if the number of the consent messages reaches a preset threshold, where The consent message is that the other node replies if the received election request message is the first request message within the preset time period; and the second trigger unit sets the sum of the weight values of the first node to be The largest of all the nodes, the first node is switched to the master node, wherein the weight value is replied by the other node according to the data information carried in the election request message and the election policy; If there is a master node in the other nodes, the other node replies to the reject message.
可选地,所述选举请求模块,包括:优先级单元,设置为在所述第一节点的优先级符合选举条件的情况下,触发所述选举请求模块向所述其他节点发送选举请求消息,其中,所述优先级是预先设定的。Optionally, the election request module includes: a priority unit, configured to trigger the election request module to send an election request message to the other node if the priority of the first node meets an election condition, Wherein, the priority is preset.
可选地,所述复制组中的第一节点与复制组中其他节点建立连接,包括:所述复制组中与所述第一节点建立连接的节点数量达到预设阈值。Optionally, the first node in the replication group establishes a connection with other nodes in the replication group, and the number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
可选地,所述复制组为至少两个,且均设置在同一个物理服务器上。Optionally, the replication groups are at least two and are all disposed on the same physical server.
可选地,所述复制组中的节点数量为至少三个。 Optionally, the number of nodes in the replication group is at least three.
根据本发明实施例的再一个方面,提供了一种存储系统,所述存储系统包括至少一个复制组,所述复制组包括:According to still another aspect of the embodiments of the present invention, a storage system is provided, where the storage system includes at least one replication group, and the replication group includes:
第一节点,包括:The first node, including:
连接模块,设置为与复制组中其他节点建立连接;查询模块,设置为判断所述其他节点中是否存在主节点;选举请求模块,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;切换模块,设置为根据所述选举结果确定是否切换为主节点;a connection module, configured to establish a connection with other nodes in the replication group; a query module, configured to determine whether the other node has a primary node; and an election request module, configured to not have a primary node in the other node, Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module is configured to determine whether to switch to the primary node according to the election result;
其他节点,设置为根据所述选举请求消息及选举策略回复选举结果。The other node is configured to reply to the election result according to the election request message and the election policy.
通过本发明实施例,采用复制组中的第一节点与复制组中其他节点建立连接;所述第一节点判断所述其他节点中是否存在主节点;若不存在,则所述第一节点向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;所述第一节点根据所述选举结果确定是否切换为主节点的方式,使得分布式存储系统内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。According to the embodiment of the present invention, the first node in the replication group is used to establish a connection with other nodes in the replication group; the first node determines whether the other node has a primary node; if not, the first node The other node sends an election request message, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines, according to the election result, whether to switch to the primary node, so that the distributed storage is performed. The primary node in the replication group in the system is always kept at one time, which avoids the problem of data consistency between nodes.
附图说明DRAWINGS
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:
图1是根据本发明实施例的分布式系统中网络分区的示意图;1 is a schematic diagram of a network partition in a distributed system according to an embodiment of the present invention;
图2是根据本发明实施例的主节点选举方法流程图一;2 is a flowchart 1 of a method for electing a master node according to an embodiment of the present invention;
图3是根据本发明实施例的主节点选举方法流程图二;3 is a second flowchart of a method for electing a master node according to an embodiment of the present invention;
图4是根据本发明实施例的主节点选举装置结构框图一;4 is a structural block diagram 1 of a primary node election apparatus according to an embodiment of the present invention;
图5是根据本发明实施例的主节点选举装置结构框图二;FIG. 5 is a second structural block diagram of a primary node election apparatus according to an embodiment of the present invention; FIG.
图6是根据本发明实施例的存储系统应用示意图。6 is a schematic diagram of a storage system application according to an embodiment of the present invention.
具体实施方式detailed description
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。The invention will be described in detail below with reference to the drawings in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.
本发明实施例提供了一种主节点选举方法,图2是根据本发明实施例的主节点选举方法流程图一,如图2所示,该方法包括:The embodiment of the present invention provides a primary node election method. FIG. 2 is a flowchart 1 of a primary node election method according to an embodiment of the present invention. As shown in FIG. 2, the method includes:
S202,复制组中的第一节点与复制组中其他节点建立连接,其中,复制组中的各个节点互相连接在一起并存储相同的数据,该复制组可以通过节点的冗余以及相关调度算法,保证所存储数据的可用性及一致性。分布式存储系统可以由一个或多个这样的复制组构成,并向应用服 务器提供数据存储服务。S202. The first node in the replication group establishes a connection with other nodes in the replication group, where each node in the replication group is connected to each other and stores the same data, and the replication group can pass node redundancy and related scheduling algorithms. Guarantee the availability and consistency of the stored data. A distributed storage system can be composed of one or more such replication groups and served to the application. The server provides a data storage service.
S204,该第一节点判断该复制组内的其他节点中是否存在主节点,可以采用的方式为轮询所有节点身份,查找并判断是否存在主节点。S204: The first node determines whether a primary node exists in other nodes in the replication group, and may adopt a method of polling all node identities to find and determine whether a primary node exists.
S206,若不存在主节点,则第一节点向复制组内的其他节点发送选举请求消息,其他节点收到上述选举请求消息后根据相应的选举策略回复其选举的结果。S206. If the primary node does not exist, the first node sends an election request message to other nodes in the replication group, and the other nodes reply to the election result according to the corresponding election policy after receiving the election request message.
S208,第一节点根据其他节点回复的选举结果确定其自身是否切换为主节点。S208. The first node determines, according to the election result replied by the other node, whether it switches to the master node.
其中,如果在判断过程中发现已经存在一个主节点,则该第一节点保持为从节点身份。并且可以启动数据实时同步流程,同时监控主节点状态,在发现主节点失效时重新执行步骤S204。Wherein, if it is found in the judging process that a master node already exists, the first node remains as the slave node identity. And the data real-time synchronization process can be started, and the status of the master node is monitored, and step S204 is re-executed when the master node is found to be invalid.
通过本发明实施例,复制组中的第一节点与复制组中其他节点建立连接,第一节点判断其他节点中是否存在主节点,若不存在,则第一节点向其他节点发送选举请求消息,其他节点收到选举请求消息后根据选举策略回复选举结果,第一节点根据选举结果确定是否切换为主节点。使得分布式存储系统内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。According to the embodiment of the present invention, the first node in the replication group establishes a connection with other nodes in the replication group, and the first node determines whether the primary node exists in the other node. If not, the first node sends an election request message to other nodes. After receiving the election request message, the other node replies to the election result according to the election policy, and the first node determines whether to switch to the primary node according to the election result. The primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
在本发明的一个实施方式中,其他节点收到选举请求消息后根据选举策略回复选举结果可以包括以下方式之一:In an embodiment of the present invention, the other node may reply to the election result according to the election policy after receiving the election request message, and may include one of the following methods:
其他节点接收的选举请求消息是预设时间段内第一个请求消息的情况下,该其他节点就回复同意消息,当所回复的同意消息数量达到了预先设定阈值,上述第一节点就切换为主节点;If the election request message received by the other node is the first request message in the preset time period, the other node replies to the consent message, and when the number of the agreed message returned reaches a preset threshold, the first node switches to Master node
其他节点根据选举请求消息中携带的数据信息以及选举策略,回复权重值,其中,上述数据信息中包含该其他节点回复权重值的依据,这种依据可以是预先设置好的运算逻辑。当第一节点的上述所有权重值总和在所有节点中最大的情况下,该第一节点就切换为主节点。The other node returns the weight value according to the data information carried in the election request message and the election policy, wherein the data information includes the basis for the other node to reply the weight value, and the basis may be a preset operation logic. When the sum of the above-mentioned ownership weights of the first node is the largest among all the nodes, the first node switches to the master node.
以上述第一种方式举例,初始启动或主节点故障时,复制组内的从节点无法连接到主节点,则触发一次选举过程,直到自身成为新的主节点,或找到主节点为止。当选举产生主节点以后,由主节点负责处理应用服务器的数据读写请求。图3是根据本发明实施例的主节点选举方法流程图二,如图3所示,该方法包括:In the first manner described above, when the initial startup or the primary node fails, the secondary node in the replication group cannot connect to the primary node, and an election process is triggered until it becomes a new primary node or finds the primary node. After the primary node is elected, the primary node is responsible for processing the data read and write requests of the application server. FIG. 3 is a second flowchart of a method for electing a master node according to an embodiment of the present invention. As shown in FIG. 3, the method includes:
S302,从节点A可以看作本发明中的第一节点,从节点B、C相当于上述的其他节点,本实施例中,从节点A由于无法连接到一个已存在的主节点,在复制组内广播选举请求,要求从节点B、C同意其成为新的主节点;S302, the slave node A can be regarded as the first node in the present invention, and the slave node B and C are equivalent to the other nodes mentioned above. In this embodiment, the slave node A cannot connect to an existing master node in the copy group. Broadcasting the election request internally, requesting that the node B and C agree to become the new master node;
S304,从节点B执行类似操作,发送选举请求到从节点A、C,要求成为主节点;S304, performing a similar operation from the node B, sending an election request to the slave nodes A and C, requesting to become the master node;
S306,在预设时间段T内,从节点C首先收到从节点A的选举请求,且自身无法连接到已知主节点,同意从节点A的选举请求;S306. In the preset time period T, the slave node C first receives the election request from the node A, and cannot connect to the known master node by itself, and agrees to the election request from the node A;
S308,从节点C收到从节点B的选举请求,由于已经同意从节点A的选举请求,从节点C回复拒绝消息,拒绝从节点B的选举请求; S308, receiving an election request from the node B from the node C, because the election request from the node A has been agreed, the node C replies with the rejection message, rejecting the election request from the node B;
S310,从节点B收到节点A的选举请求,由于其自身已发送选举请求(即同意自己成为新的主节点),拒绝从节点A的选举请求;S310. The node B receives the election request of the node A, and rejects the election request from the node A because it has sent an election request (ie, agrees to become the new master node);
S312,从节点A收到节点B的选举请求,由于其自身已发送选举请求(即同意自己成为新的主节点),拒绝从节点B的选举请求;S312. The node A receives an election request from the node B, and rejects the election request from the node B because it has sent an election request (ie, agrees to become the new master node);
S314,假设预设的阈值为超过半数,则从节点A收集到超过半数的选举同意反馈,成为新的主节点,并完成自身服务状态的切换;S314. If the preset threshold is more than half, the node A collects more than half of the election consent feedbacks, becomes a new master node, and completes switching of its own service state.
S316,从节点B未能收集到超过半数的选举同意反馈,因此选举失败,重新执行主节点查找流程;S316, the node B fails to collect more than half of the election consent feedback, so the election fails, and the main node search process is re-executed;
S318,从节点B、C发现并连接到主节点A,完成复制组的初始化,选举终止。S318, the nodes B and C are discovered and connected to the master node A, and the initialization of the replication group is completed, and the election is terminated.
对于步骤S302、S304中所述的选举请求,存在一个超时时间t,如果发送方在该时间t内未能收到某节点的回复,则等同于该节点投拒绝票。设置超时时间t可以避免网络原因导致的请求不可达或回复丢失。For the election request described in steps S302 and S304, there is a timeout period t. If the sender fails to receive a reply from a node within the time t, it is equivalent to the node rejecting the ticket. Setting the timeout period t can prevent the request from being unreachable or lost due to network reasons.
如步骤S306、S308、S310、S312所述,当某个从节点收到其他节点发送的选举请求时,如果其本身已经连接到一个主节点,则应当拒绝该选举请求;否则,应当保证在超时时间t内,仅同意一个从节点的选举请求。As described in steps S306, S308, S310, and S312, when a slave node receives an election request sent by another node, if it is already connected to a master node, the election request should be rejected; otherwise, it should be guaranteed to time out. Within time t, only one election request from the slave node is agreed.
实际情况下,由于各节点的请求发送是并行的,具体时序可能与上述步骤不同,但处理方式一致。在本发明实施例中,如果复制组中超过半数的节点能够互相通讯,经过1次或多次的选举,总能选出一个主节点,完成复制组的初始化过程。In actual situations, since the request transmission of each node is parallel, the specific timing may be different from the above steps, but the processing manner is the same. In the embodiment of the present invention, if more than half of the nodes in the replication group can communicate with each other, after one or more elections, one master node can always be selected to complete the initialization process of the replication group.
在本发明的一个实施方式中,为了避免多个节点同时发起选举的冲突问题,可以为从节点配置默认的选举发起优先级,比如,当预先配置上述第一节点为第二优先级时,此时如果该第一节点要发起选举流程,其所必须符合的选举条件即为:处于第一优先级的从节点已经发起了选举流程或者无法连接到。In an embodiment of the present invention, in order to avoid conflicts in which multiple nodes initiate elections at the same time, a default election initiation priority may be configured for the slave node, for example, when the first node is pre-configured as the second priority. If the first node wants to initiate the election process, the election conditions that must be met are: the slave node in the first priority has initiated the election process or is unable to connect.
在本发明的一个实施方式中,由于复制组内主从节点负载不同,可以将复制组内的节点设计为虚拟的,在一个物理服务器上部署两个或者多个复制组,每个物理服务器同时作为这些复制组中的主节点或从节点,如此可以达到平衡负载的目的。当然,不同复制组之间的选举过程都是相互独立的。其中,该复制组中的节点数量一般为3个或者多个。In an embodiment of the present invention, since the load of the master and slave nodes in the replication group is different, the nodes in the replication group may be designed to be virtual, and two or more replication groups may be deployed on one physical server, and each physical server is simultaneously As the master node or slave node in these replication groups, the purpose of balancing the load can be achieved. Of course, the electoral process between different replication groups is independent of each other. The number of nodes in the replication group is generally three or more.
本发明实施例还提供了一种主节点选举装置,设置于第一节点上,图4是根据本发明实施例的主节点选举装置结构框图一,如图4所示,该装置包括:连接模块402,设置为与复制组中其他节点建立连接;查询模块404,设置为判断所述其他节点中是否存在主节点;选举请求模块406,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;切换模块408,设置为根据所述选举结果确定是否将所述第一节点切换为主节点。The embodiment of the present invention further provides a primary node election device, which is disposed on the first node. FIG. 4 is a structural block diagram of a primary node election device according to an embodiment of the present invention. As shown in FIG. 4, the device includes: a connection module. 402, configured to establish a connection with other nodes in the replication group; the query module 404 is configured to determine whether the other node has a primary node; and the election request module 406 is configured to not exist in the other node. Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module 408 is configured to determine, according to the election result, whether to switch the first node to Primary node.
其中,如果查询模块404在判断过程中发现已经存在一个主节点,则该第一节点保持为从 节点身份。并且可以启动数据实时同步流程,同时监控主节点状态,在发现主节点失效时查询模块404重新执行查询功能。If the query module 404 finds that a primary node already exists in the determining process, the first node remains as a slave. Node identity. And the data real-time synchronization process can be started, and the status of the master node is monitored at the same time, and the query module 404 re-executes the query function when the master node is found to be invalid.
通过本发明实施例,第一节点中的连接模块402与复制组中其他节点建立连接,查询模块404判断其他节点中是否存在主节点,若不存在,则选举请求模块406向其他节点发送选举请求消息,其他节点收到选举请求消息后根据选举策略回复选举结果,切换模块408根据选举结果确定是否切换为主节点。使得分布式存储系统内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。With the embodiment of the present invention, the connection module 402 in the first node establishes a connection with other nodes in the replication group, and the query module 404 determines whether there is a primary node in the other node. If not, the election request module 406 sends an election request to other nodes. The message, after receiving the election request message, the other node replies to the election result according to the election policy, and the switching module 408 determines whether to switch to the master node according to the election result. The primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
图5是根据本发明实施例的主节点选举装置结构框图二,在本发明的一个实施方式中,如图5所示,切换模块508包括:FIG. 5 is a block diagram 2 of a main node election apparatus according to an embodiment of the present invention. In an embodiment of the present invention, as shown in FIG. 5, the switching module 508 includes:
第一触发单元5004,设置为在上述同意消息的数量达到预设阈值的情况下,将第一节点切换为主节点,其中,上述同意消息是其他节点在接收的选举请求消息是预设时间段内第一个请求消息的情况下回复的;The first triggering unit 5004 is configured to switch the first node to the primary node if the number of the consent messages reaches the preset threshold, where the consent message is that the election request message received by the other node is a preset time period. Reply in case of the first request message;
第二触发单元5006,设置为第一节点的权重值总和在所有节点中最大的情况下,第一节点切换为主节点,其中,权重值是由其他节点根据选举请求消息中携带的数据信息以及选举策略回复的;The second triggering unit 5006 is configured to: when the sum of the weight values of the first node is the largest among all the nodes, the first node switches to the master node, where the weight value is data information carried by other nodes according to the election request message and Reply to the election strategy;
其中,在其他节点中存在主节点的情况下,其他节点回复拒绝消息。Wherein, in the case where the master node exists in other nodes, the other nodes reply to the reject message.
在本发明的一个实施方式中,如图5所示,选举请求模块506包括:In an embodiment of the present invention, as shown in FIG. 5, the election request module 506 includes:
优先级单元5002,设置为在第一节点的优先级符合选举条件的情况下,触发选举请求模块506向其他节点发送选举请求消息,其中,该优先级是预先设定的。The priority unit 5002 is configured to trigger the election request module 506 to send an election request message to other nodes when the priority of the first node meets the election condition, wherein the priority is preset.
本发明实施例还提供了一种存储系统,图6是根据本发明实施例的存储系统应用示意图,如图6所示,该存储系统包括至少一个复制组:由一个主节点和两个以上的从节点组成,组内的节点上存储相同的数据,通过节点的冗余以及相关调度算法,保证所存储数据的可用性及一致性。整个存储系统由1个或多个复制组构成,向应用服务器提供数据存储服务。The embodiment of the present invention further provides a storage system. FIG. 6 is a schematic diagram of a storage system application according to an embodiment of the present invention. As shown in FIG. 6, the storage system includes at least one replication group: one primary node and two or more It consists of nodes, the same data is stored on the nodes in the group, and the availability and consistency of the stored data are guaranteed by the redundancy of the nodes and related scheduling algorithms. The entire storage system consists of one or more replication groups that provide data storage services to the application server.
本发明实施例中的第一节点经过选举后成为主节点,其在复制组中是提供数据读写服务的节点,负责处理应用服务器发送的读写请求,并向从节点同步其存储的数据。The first node in the embodiment of the present invention becomes a master node after being elected, and is a node that provides data read/write service in the replication group, and is responsible for processing the read and write request sent by the application server, and synchronizing the stored data to the slave node.
本发明实施例中的其他节点就相当于图6中的从节点,是主节点的备份节点,提供与主节点相同的数据存取能力,并从主节点同步数据,保持数据状态一致。与该第一节点一样,当这些从节点检测到主节点不可用时,也可发起选举,参与新的主节点选择过程。并根据选举结果,切换为主节点,或与新的主节点开始数据同步。The other nodes in the embodiment of the present invention are equivalent to the slave node in FIG. 6, which is the backup node of the master node, provides the same data access capability as the master node, and synchronizes data from the master node to keep the data state consistent. Like the first node, when these slave nodes detect that the master node is unavailable, elections can also be initiated to participate in the new master node selection process. And according to the election result, switch to the master node, or start data synchronization with the new master node.
图6中的应用服务器,是部署用户特定应用程序的节点,该程序使用该存储系统提供的数据读写服务。通常,该程序通过一个由存储系统发行的接口库进行数据读写,其本身并不是数据存储系统的一部分。为简化表述,本发明假定接口库具有区分主从节点,并自动选择主节点进行数据读写的能力。实际上,该特性可以实现在系统外部,即由用户程序完成。 The application server in Figure 6 is a node that deploys a user-specific application that uses the data read and write services provided by the storage system. Typically, the program reads and writes data through an interface library issued by the storage system, which is not itself part of the data storage system. To simplify the description, the present invention assumes that the interface library has the ability to distinguish between the master and slave nodes and automatically select the master node for data reading and writing. In fact, this feature can be implemented outside the system, ie by the user program.
其中,本发明实施例中的应用服务器与复制组中所有节点连接,复制组中所有节点也互相连接。The application server in the embodiment of the present invention is connected to all nodes in the replication group, and all nodes in the replication group are also connected to each other.
通过本发明实施例,该存储系统中第一节点的连接模块与复制组中其他节点建立连接,查询模块判断其他节点中是否存在主节点,若不存在,则选举请求模块向其他节点发送选举请求消息,其他节点收到选举请求消息后根据选举策略回复选举结果,切换模块根据选举结果确定是否切换为主节点。使得分布式存储系统内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。According to the embodiment of the present invention, the connection module of the first node in the storage system establishes a connection with other nodes in the replication group, and the query module determines whether there is a primary node in the other node. If not, the election request module sends an election request to other nodes. The message, after receiving the election request message, the other node replies to the election result according to the election policy, and the switching module determines whether to switch to the master node according to the election result. The primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。It will be apparent to those skilled in the art that the various modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein. The steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above description is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.
工业实用性Industrial applicability
通过本发明实施例,采用复制组中的第一节点与复制组中其他节点建立连接;所述第一节点判断所述其他节点中是否存在主节点;若不存在,则所述第一节点向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;所述第一节点根据所述选举结果确定是否切换为主节点。使得分布式存储系统内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。 According to the embodiment of the present invention, the first node in the replication group is used to establish a connection with other nodes in the replication group; the first node determines whether the other node has a primary node; if not, the first node The other node sends an election request message, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines whether to switch to the primary node according to the election result. The primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.

Claims (13)

  1. 一种主节点选举方法,包括:复制组中的第一节点与复制组中其他节点建立连接;A method for electing a primary node includes: establishing a connection between a first node in a replication group and other nodes in a replication group;
    所述第一节点判断所述其他节点中是否存在主节点;Determining, by the first node, whether a primary node exists in the other nodes;
    若不存在,则所述第一节点向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;If not, the first node sends an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy;
    所述第一节点根据所述选举结果确定是否切换为主节点。The first node determines whether to switch to the primary node according to the election result.
  2. 根据权利要求1所述的方法,其中,所述其他节点根据选举策略回复选举结果,包括以下至少之一:The method of claim 1, wherein the other node replies to the election result according to an election policy, including at least one of the following:
    所述其他节点接收的所述选举请求消息是预设时间段内第一个请求消息的情况下,所述其他节点回复同意消息,在所述同意消息的数量达到预设阈值的情况下,所述第一节点切换为主节点;If the election request message received by the other node is the first request message in the preset time period, the other node replies to the consent message, and if the number of the consent message reaches a preset threshold, the The first node is switched to be a primary node;
    所述其他节点根据所述选举请求消息中携带的数据信息以及所述选举策略,回复权重值,所述第一节点的所述权重值总和在所有节点中最大的情况下,所述第一节点切换为主节点;And the other node returns a weight value according to the data information carried in the election request message and the election policy, where the sum of the weight values of the first node is the largest among all nodes, the first node Switch to the master node;
    所述其他节点中存在主节点的情况下,所述其他节点回复拒绝消息。In the case where there is a master node among the other nodes, the other nodes reply to the reject message.
  3. 根据权利要求1所述的方法,其中,所述第一节点向所述其他节点发送选举请求消息之前,包括:The method of claim 1, wherein before the first node sends an election request message to the other node, the method includes:
    在所述第一节点的优先级符合选举条件的情况下,向所述其他节点发送选举请求消息,其中,所述优先级是预先设定的。And if the priority of the first node meets the election condition, sending an election request message to the other node, where the priority is preset.
  4. 根据权利要求1至3任一项所述的方法,其中,所述复制组中的第一节点与复制组中其他节点建立连接,包括:The method according to any one of claims 1 to 3, wherein the first node in the replication group establishes a connection with other nodes in the replication group, including:
    所述复制组中与所述第一节点建立连接的节点数量达到预设阈值。The number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
  5. 根据权利要求4任一项所述的方法,其中,所述复制组为至少两个,且均设置在同一个物理服务器上。The method according to any of claims 4, wherein the copy groups are at least two and are all disposed on the same physical server.
  6. 根据权利要求5所述的方法,其中,所述复制组中的节点数量为至少三个。The method of claim 5 wherein the number of nodes in the replication group is at least three.
  7. 一种主节点选举装置,设置于所述第一节点上,包括:A primary node election device is disposed on the first node, and includes:
    连接模块,设置为与复制组中其他节点建立连接;a connection module, set to establish a connection with other nodes in the replication group;
    查询模块,设置为判断所述其他节点中是否存在主节点;a query module, configured to determine whether a primary node exists in the other nodes;
    选举请求模块,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果; An election requesting module is configured to send an election request message to the other node if the primary node does not exist in the other node, where the election request message is used by the other node to reply to the election result according to the election policy;
    切换模块,设置为根据所述选举结果确定是否将所述第一节点切换为主节点。And a switching module, configured to determine, according to the election result, whether to switch the first node to a primary node.
  8. 根据权利要求7所述的装置,其中,所述切换模块包括以下至少之一:The apparatus of claim 7, wherein the switching module comprises at least one of:
    第一触发单元,设置为在所述同意消息的数量达到预设阈值的情况下,将所述第一节点切换为主节点,其中,所述同意消息是所述其他节点在接收的所述选举请求消息是预设时间段内第一个请求消息的情况下回复的;a first triggering unit, configured to switch the first node to a primary node if the number of the consent messages reaches a preset threshold, where the consent message is the election that the other node is receiving The request message is replied in the case of the first request message within the preset time period;
    第二触发单元,设置为所述第一节点的权重值总和在所有节点中最大的情况下,所述第一节点切换为主节点,其中,所述权重值是由所述其他节点根据所述选举请求消息中携带的数据信息以及所述选举策略回复的;a second triggering unit, configured to: when the sum of weight values of the first node is the largest among all nodes, the first node switches to a master node, where the weight value is determined by the other node according to the The data information carried in the election request message and the reply of the election policy;
    其中,所述其他节点中存在主节点的情况下,所述其他节点回复拒绝消息。Where the other node has a primary node, the other node replies to the rejection message.
  9. 根据权利要求7所述的装置,其中,所述选举请求模块,包括:The apparatus of claim 7, wherein the election request module comprises:
    优先级单元,设置为在所述第一节点的优先级符合选举条件的情况下,触发所述选举请求模块向所述其他节点发送选举请求消息,其中,所述优先级是预先设定的。The priority unit is configured to trigger the election request module to send an election request message to the other node if the priority of the first node meets the election condition, wherein the priority is preset.
  10. 根据权利要求7至9任一项所述的装置,其中,所述复制组中的第一节点与复制组中其他节点建立连接,包括:The apparatus according to any one of claims 7 to 9, wherein the first node in the replication group establishes a connection with other nodes in the replication group, including:
    所述复制组中与所述第一节点建立连接的节点数量达到预设阈值。The number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
  11. 根据权利要求10任一项所述的装置,其中,所述复制组为至少两个,且均设置在同一个物理服务器上。The apparatus according to any one of claims 10 to 10, wherein the copy groups are at least two and are each disposed on the same physical server.
  12. 根据权利要求11所述的装置,其中,所述复制组中的节点数量为至少三个。The apparatus of claim 11 wherein the number of nodes in the replication group is at least three.
  13. 一种存储系统,所述存储系统包括至少一个复制组,所述复制组包括:A storage system, the storage system comprising at least one replication group, the replication group comprising:
    第一节点,包括:The first node, including:
    连接模块,设置为与复制组中其他节点建立连接;查询模块,设置为判断所述其他节点中是否存在主节点;选举请求模块,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;切换模块,设置为根据所述选举结果确定是否切换为主节点;a connection module, configured to establish a connection with other nodes in the replication group; a query module, configured to determine whether the other node has a primary node; and an election request module, configured to not have a primary node in the other node, Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module is configured to determine whether to switch to the primary node according to the election result;
    其他节点,设置为根据所述选举请求消息及选举策略回复选举结果。 The other node is configured to reply to the election result according to the election request message and the election policy.
PCT/CN2015/086169 2015-03-25 2015-08-05 Master node election method and apparatus, and storage system WO2016150066A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510133374.5 2015-03-25
CN201510133374.5A CN106161495A (en) 2015-03-25 2015-03-25 A kind of host node electoral machinery, device and storage system

Publications (1)

Publication Number Publication Date
WO2016150066A1 true WO2016150066A1 (en) 2016-09-29

Family

ID=56976948

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/086169 WO2016150066A1 (en) 2015-03-25 2015-08-05 Master node election method and apparatus, and storage system

Country Status (2)

Country Link
CN (1) CN106161495A (en)
WO (1) WO2016150066A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106657117A (en) * 2016-12-31 2017-05-10 广州佳都信息技术研发有限公司 Method and device for managing subway integrated monitoring authority
CN112104727A (en) * 2020-09-10 2020-12-18 华云数据控股集团有限公司 Method and system for deploying simplified high-availability Zookeeper cluster
CN112533304A (en) * 2020-11-24 2021-03-19 锐捷网络股份有限公司 Ad hoc network management method, device, system, electronic device and storage medium
CN112835748A (en) * 2019-11-22 2021-05-25 上海宝信软件股份有限公司 Multi-center redundancy arbitration method and system based on scada system
CN113297236A (en) * 2020-11-10 2021-08-24 阿里巴巴集团控股有限公司 Method, device and system for electing main node in distributed consistency system
CN113489601A (en) * 2021-06-11 2021-10-08 海南视联通信技术有限公司 Anti-destruction method and device based on video networking autonomous cloud network architecture
CN114448769A (en) * 2022-04-02 2022-05-06 支付宝(杭州)信息技术有限公司 Node election voting method and device based on consensus system

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391320B (en) * 2017-03-10 2020-07-10 创新先进技术有限公司 Consensus method and device
CN106911524B (en) * 2017-04-27 2020-07-07 新华三信息技术有限公司 HA implementation method and device
CN107832138B (en) * 2017-09-21 2021-09-14 南京邮电大学 Method for realizing flattened high-availability namenode model
CN107995029B (en) * 2017-11-28 2019-12-13 新华三信息技术有限公司 Election control method and device and election method and device
CN110417842B (en) 2018-04-28 2022-04-12 北京京东尚科信息技术有限公司 Fault processing method and device for gateway server
CN108810100B (en) * 2018-05-22 2021-06-29 郑州云海信息技术有限公司 Method, device and equipment for electing master node
CN109040184B (en) * 2018-06-28 2021-09-07 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) Host node election method and server
CN110764690B (en) * 2018-07-28 2023-04-14 阿里云计算有限公司 Distributed storage system and leader node election method and device thereof
CN110784331B (en) 2018-07-30 2022-05-13 华为技术有限公司 Consensus process recovery method and related nodes
CN109379238B (en) * 2018-12-14 2022-06-17 郑州云海信息技术有限公司 CTDB main node election method, device and system of distributed cluster
CN112398664B (en) * 2019-08-13 2023-08-08 中兴通讯股份有限公司 Main device selection method, device management method, electronic device and storage medium
CN111093249B (en) * 2019-12-05 2022-06-21 合肥中感微电子有限公司 Wireless local area network communication method, system and wireless transceiving equipment
CN112988882B (en) * 2019-12-12 2024-01-23 阿里巴巴集团控股有限公司 System, method and device for preparing data from different places and computing equipment
CN113742417A (en) * 2020-05-29 2021-12-03 同方威视技术股份有限公司 Multi-level distributed consensus method and system, electronic device and computer readable medium
CN112000285A (en) * 2020-08-12 2020-11-27 广州市百果园信息技术有限公司 Strong consistent storage system, strong consistent data storage method, server and medium
CN113596093A (en) * 2021-06-28 2021-11-02 青岛海尔科技有限公司 Device set control method and device, storage medium and electronic device
CN113489149B (en) * 2021-07-01 2023-07-28 广东电网有限责任公司 Power grid monitoring system service master node selection method based on real-time state sensing
CN116107828A (en) * 2021-11-11 2023-05-12 中兴通讯股份有限公司 Main node selection method, distributed database and storage medium
CN116910158A (en) * 2023-08-17 2023-10-20 深圳计算科学研究院 Data processing and inquiring method, device, equipment and medium based on copy group

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101043398A (en) * 2006-06-28 2007-09-26 华为技术有限公司 Method and system for sharing connection dynamically
US20080281938A1 (en) * 2007-05-09 2008-11-13 Oracle International Corporation Selecting a master node in a multi-node computer system
CN101661408A (en) * 2009-09-14 2010-03-03 四川川大智胜软件股份有限公司 Distributed real-time data replication synchronizing method
CN101702721A (en) * 2009-10-26 2010-05-05 北京航空航天大学 Reconfigurable method of multi-cluster system
CN103118084A (en) * 2013-01-21 2013-05-22 浪潮(北京)电子信息产业有限公司 Host node election method and node

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101217402B (en) * 2008-01-15 2012-01-04 杭州华三通信技术有限公司 A method to enhance the reliability of the cluster and a high reliability communication node
CN102843259A (en) * 2012-08-21 2012-12-26 武汉达梦数据库有限公司 Middleware self-management hot backup method and middleware self-management hot backup system in cluster
CN102904752B (en) * 2012-09-25 2016-06-29 新浪网技术(中国)有限公司 A kind of node electoral machinery, node device and system
CN103491168A (en) * 2013-09-24 2014-01-01 浪潮电子信息产业股份有限公司 Cluster election design method
CN103634375B (en) * 2013-11-07 2017-01-11 华为技术有限公司 Method, device and equipment for cluster node expansion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101043398A (en) * 2006-06-28 2007-09-26 华为技术有限公司 Method and system for sharing connection dynamically
US20080281938A1 (en) * 2007-05-09 2008-11-13 Oracle International Corporation Selecting a master node in a multi-node computer system
CN101661408A (en) * 2009-09-14 2010-03-03 四川川大智胜软件股份有限公司 Distributed real-time data replication synchronizing method
CN101702721A (en) * 2009-10-26 2010-05-05 北京航空航天大学 Reconfigurable method of multi-cluster system
CN103118084A (en) * 2013-01-21 2013-05-22 浪潮(北京)电子信息产业有限公司 Host node election method and node

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106657117A (en) * 2016-12-31 2017-05-10 广州佳都信息技术研发有限公司 Method and device for managing subway integrated monitoring authority
CN112835748A (en) * 2019-11-22 2021-05-25 上海宝信软件股份有限公司 Multi-center redundancy arbitration method and system based on scada system
CN112104727A (en) * 2020-09-10 2020-12-18 华云数据控股集团有限公司 Method and system for deploying simplified high-availability Zookeeper cluster
CN112104727B (en) * 2020-09-10 2021-11-30 华云数据控股集团有限公司 Method and system for deploying simplified high-availability Zookeeper cluster
CN113297236A (en) * 2020-11-10 2021-08-24 阿里巴巴集团控股有限公司 Method, device and system for electing main node in distributed consistency system
CN112533304A (en) * 2020-11-24 2021-03-19 锐捷网络股份有限公司 Ad hoc network management method, device, system, electronic device and storage medium
CN112533304B (en) * 2020-11-24 2023-10-20 锐捷网络股份有限公司 Ad hoc network management method, device, system, electronic equipment and storage medium
CN113489601A (en) * 2021-06-11 2021-10-08 海南视联通信技术有限公司 Anti-destruction method and device based on video networking autonomous cloud network architecture
CN113489601B (en) * 2021-06-11 2024-05-14 海南视联通信技术有限公司 Anti-destruction method and device based on autonomous cloud network architecture of video networking
CN114448769A (en) * 2022-04-02 2022-05-06 支付宝(杭州)信息技术有限公司 Node election voting method and device based on consensus system

Also Published As

Publication number Publication date
CN106161495A (en) 2016-11-23

Similar Documents

Publication Publication Date Title
WO2016150066A1 (en) Master node election method and apparatus, and storage system
JP6382454B2 (en) Distributed storage and replication system and method
CN102404390B (en) Intelligent dynamic load balancing method for high-speed real-time database
US7225356B2 (en) System for managing operational failure occurrences in processing devices
WO2019119212A1 (en) Method and device for identifying osd sub-health, and data storage system
US11106556B2 (en) Data service failover in shared storage clusters
CN108551765A (en) input/output isolation optimization
EP3461065B1 (en) Cluster arbitration method and multi-cluster cooperation system
US10652100B2 (en) Computer system and method for dynamically adapting a software-defined network
CN109802986B (en) Equipment management method, system, device and server
CN109040184A (en) A kind of electoral machinery and server of host node
CN114265753A (en) Management method and management system of message queue and electronic equipment
EP3813335B1 (en) Service processing methods and systems based on a consortium blockchain network
CN114124650A (en) Master-slave deployment method of SPTN (shortest Path bridging) network controller
CN105323271B (en) Cloud computing system and processing method and device thereof
CN114124803B (en) Device management method and device, electronic device and storage medium
CN111510336B (en) Network equipment state management method and device
US20210406141A1 (en) Computer cluster with adaptive quorum rules
CN113794765A (en) Gate load balancing method and device based on file transmission
US9798633B2 (en) Access point controller failover system
CN113923222A (en) Data processing method and device
CN110266795A (en) One kind being based on Openstack platform courses method
CN108959170B (en) Virtual device management method, device, stacking system and readable storage medium
CN114490158A (en) Distributed disaster recovery system, server node processing method, device and equipment
JP6100135B2 (en) Fault tolerant system and fault tolerant system control method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15885998

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15885998

Country of ref document: EP

Kind code of ref document: A1