WO2016050074A1 - Cluster split brain processing method and apparatus - Google Patents

Cluster split brain processing method and apparatus Download PDF

Info

Publication number
WO2016050074A1
WO2016050074A1 PCT/CN2015/079096 CN2015079096W WO2016050074A1 WO 2016050074 A1 WO2016050074 A1 WO 2016050074A1 CN 2015079096 W CN2015079096 W CN 2015079096W WO 2016050074 A1 WO2016050074 A1 WO 2016050074A1
Authority
WO
WIPO (PCT)
Prior art keywords
subset
node
cluster
nodes
primary
Prior art date
Application number
PCT/CN2015/079096
Other languages
French (fr)
Chinese (zh)
Inventor
胡智江
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016050074A1 publication Critical patent/WO2016050074A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/08Allotting numbers to messages; Counting characters, words or messages

Definitions

  • This paper relates to the field of computer applications, and in particular to a cluster brain splitting method and apparatus.
  • High availability clusters are server clustering technologies designed to reduce service downtime.
  • the node that is running the service is called the primary machine.
  • a node that is not running the service, but may subsequently take over the service running on the primary machine is called a standby machine. When the main machine fails, the standby machine will take over and continue to run the service to achieve the effect of providing continuous service.
  • the inter-node interconnection network is called a heartbeat line.
  • each node in the cluster can communicate with any other node.
  • the communication protocol it can also know which nodes in the current cluster (the module that provides the communication function below is called "heartbeat communication module").
  • heartbeat communication module the module that provides the communication function below.
  • a and B Take two nodes A and B to form a cluster. For example, a service is running on A and B is used as a backup machine. When Node B finds that it cannot communicate with A, if it guesses that it is a network failure, then B will keep the standby role unchanged. However, if it is actually an A node failure, the cluster will lose its main use and the upper application will not continue to run. Conversely, if Node B guesses that Node A is faulty, then B will take over from A to run the service. But if it is only a network failure, A is still running normally, then there are two main machines A and B in the cluster. The situation of multiple active machines is also a cluster that needs to be avoided, because multiple active machines compete for resources with each other, and in severe cases, data may be destroyed.
  • This paper provides a clustering method and device for brain splitting, which solves the problem of post-brain cracking control.
  • a cluster brain splitting method includes:
  • a subset that is uniquely allowed to continue to be serviced is selected from the plurality of subsets;
  • selecting a subset from the plurality of subsets that is allowed to continue to serve includes:
  • the method further includes:
  • a disk space is opened on the shared medium as a decision disk, and the decision disk is partitioned, and each node in the cluster is uniquely corresponding to a partition of the decision disk;
  • Each node in the cluster writes a current timestamp to a corresponding partition in the decision disk through a disk input/output I/O operation;
  • One of the nodes whose number of times of updating the time stamp is greater than the threshold in a time range is selected as the primary node.
  • the method further includes:
  • Each node in the cluster broadcasts or multicasts a KeepAlive message through an additional Ethernet period under normal conditions without brain splitting;
  • One of the nodes that issued the KeepAlive message for a number of times greater than the threshold within a time range is selected as the primary node.
  • the method further includes:
  • a representative node is assigned from the desired primary subset, instructing the representative node to notify all nodes of each of the subsets other than the desired primary subset to stop working after the first delay time.
  • selecting a subset that is uniquely allowed to continue from the desired primary subset and the only large subset includes:
  • the desired primary subset is selected as the only subset that is allowed to continue the service
  • the subset is used as the only subset that allows for continued service
  • the unique large subset is used as the only subset that allows for continued service.
  • the method further includes:
  • the method further includes:
  • the node When the node detects that the heartbeat communication is interrupted, the communication between the underlying heartbeat communication of the node and the upper layer service control logic is interrupted, and after reaching the first time length, the occurrence of brain splitting is determined, and the underlying heartbeat communication and the upper layer service control logic are restored. Communication between.
  • the method further includes:
  • the second time length is less than the first time length from the time when the node is determined to have a brain splitting to the second time length.
  • the method further includes:
  • the current cluster member list, the number of members, and the cluster member change notification information are maintained at each node of the cluster.
  • a cluster splitting device includes:
  • the service subset selection module is further configured to: when the cluster splits into a plurality of subsets, select a subset from the plurality of subsets that is allowed to continue to serve;
  • the node shutdown control module is configured to: control the nodes in the other subsets except the subset that is allowed to continue to serve to stop working.
  • the continuing service subset selection module includes:
  • the main subset selection unit is expected to be set to: select a subset of the main nodes before the occurrence of the brain splitting as a desired main subset;
  • the only large subset selection unit is set to: select the subset whose number of nodes is greater than half the number of cluster nodes before the occurrence of the brain split as the only large subset;
  • the continuation service subset selection unit is configured to select a subset that is uniquely allowed to continue from the desired primary subset and the unique large subset.
  • the continuing service subset selection module further includes:
  • a representative node selecting unit configured to: assign a representative node from the desired primary subset, instructing the representative node to notify all nodes of each of the subsets except the expected primary subset after the first delay time stop working.
  • the continuing service subset selection unit includes:
  • a first selection sub-unit configured to: when the unique large subset does not exist, select the desired primary subset as a subset that is uniquely allowed to continue serving;
  • a second selection subunit configured to: in the desired primary subset and the only large subset The same subset, with this subset as the only subset that is allowed to continue the service;
  • a third selection subunit configured to: when the expected primary subset and the unique large subset are different subsets, use the unique large subset as the only subset that allows to continue the service.
  • the continuing service subset selection module further includes:
  • the only large subset represents a selection unit, configured to: select a node from the unique large subset as a unique large subset representation, indicating that the unique large subset represents determining the unique large subset and the desired primary child
  • select a node from the unique large subset as a unique large subset representation indicating that the unique large subset represents determining the unique large subset and the desired primary child
  • the device further includes:
  • the internal communication management module is configured to: when the node detects that the heartbeat communication is interrupted, interrupt communication between the bottom heartbeat communication of the node and the upper service control logic, and after reaching the first time length, determine the occurrence of brain splitting, and restore the Communication between the underlying heartbeat communication and the upper layer service control logic.
  • the apparatus further includes:
  • the primary node election module is configured to: elect a node from the unique large subset as a new primary node, and the election of the new primary node takes time from the time when the brain split is determined to the second time length, the first The second time length is less than the first time length.
  • the device further includes:
  • the storage module is set to maintain the current cluster member list, the number of members, and the cluster member change notification information.
  • a computer readable storage medium storing program instructions that are implemented when the program instructions are executed.
  • This paper provides a method and device for processing a cluster splitting.
  • a cluster has a brain split
  • the only subset of the cluster that is allowed to continue to serve is selected, and the control is controlled except for the only subset that allows the service to continue. Nodes in other subsets stop working.
  • the orderly management of the cluster under the condition of cluster splitting is realized, and the control problem after cluster splitting is solved.
  • FIG. 1 is a schematic diagram of a cluster splitting processing system according to Embodiment 1 of the present invention.
  • FIG. 2 is a schematic diagram of module cooperation and timing relationship of a first-stage decision method in the event of a primary node failure
  • FIG. 3 is a schematic diagram of module cooperation and timing relationship of a first-step decision method when a non-primary node fails or a heartbeat line breaks;
  • FIG. 4 is a schematic diagram of module cooperation and timing relationship when the second step decision method finds that only a large subset and only a large subset is not the main subset expected after the heartbeat line breaks;
  • FIG. 5 is a schematic diagram of module cooperation and timing relationship when a two-step decision method finds a unique large subset and a unique large subset is identical to a desired primary subset after a heartbeat line break;
  • FIG. 6 is a flowchart of a cluster splitting processing method according to Embodiment 2 of the present invention.
  • FIG. 7 is a specific flow chart of step 601 of Figure 6;
  • FIG. 8 is a schematic structural diagram of a cluster splitting device according to Embodiment 3 of the present invention.
  • FIG. 9 is a schematic structural diagram of the continuation service subset selection module 801 of FIG. 8;
  • FIG. 10 is a schematic structural diagram of the continuation service subset selection unit 8013 in FIG.
  • a and B Take two nodes A and B to form a cluster. For example, a service is running on A and B is used as a backup machine. When Node B finds that it cannot communicate with A, if it guesses that it is a network failure, then B will keep the standby role unchanged. However, if it is actually an A node failure, the cluster will lose its main use and the upper application will not continue to run. Conversely, if Node B guesses that Node A is faulty, then B will take over from A to run the service. But if it is only a network failure, A is still running normally, then there are two main machines A and B in the cluster. The situation of multiple active machines is also a cluster that needs to be avoided, because multiple active machines compete for resources with each other, and in severe cases, data may be destroyed.
  • the embodiment of the present invention provides a cluster splitting processing system.
  • the structure of the system is as shown in FIG. 1 , including an underlying heartbeat communication module and an upper layer service logic module, and between the service control logic module and the heartbeat communication module.
  • Brain splitting decision module The heartbeat communication module provides the brain splitting decision module with information such as the current cluster member list, the number of members, and the cluster member change notification (or called a brain split event).
  • the brain splitting decision module uses this information to determine which subset should continue to run the service, and reports the judgment result to the service control logic module, which performs necessary service control operations such as active/standby switchover based on the result.
  • the split-brain decision module is responsible for determining the only subset that allows the service to continue to run. This subset is called the "primary subset.” Services on other subsets need to be stopped (or called Fence), and these subsets are called “secondary subsets.” In the embodiment of the present invention, the node that is being Fence is powered off or restarted immediately, and stops working. Second, the main subset is as close as possible to the subset with the largest number of nodes in all split sub-sets. This will ensure that most nodes can continue to work after the split. Third, after the brain splitting event, the brain splitting decision module can immediately determine the main subset according to the information that occurs when the brain splitting event occurs, and ensure that the upper layer service control logic can perform the master/slave switching as soon as possible.
  • the brain splitting decision module in the embodiment of the present invention implements a two-step decision method: Step 1: In the event of a splitting event, the "preferred main subset" is first determined through an additional information channel. .
  • the additional information channel refers to other channels that can exchange information between cluster nodes in addition to the heartbeat line; the second step: if there is a subset, the number of nodes is greater than 50% of the number of cluster nodes before the brain split, then it It must be the largest of all subsets (hereinafter referred to as "the only large subset").
  • the only large subset is not the primary subset of the expected decision in the first step, then the only large subset immediately replaces the expected primary subset of the first decision, and the decision is the final subset of the service that can continue to run (below) Called "main subset"). If the second subset does not find a single large subset, or The only large subset is the main subset of expectations found in the first step, then the second step decision does not work, and the main subset of expectations is judged to be the final major subset.
  • the first decision method is implemented by the first decision sub-module
  • the second step decision is implemented by the second step decision sub-module.
  • the embodiment of the present invention requires an additional information channel to provide the information interaction capability of the first step decision sub-module as follows: 1.1) to 1.3):
  • the underlying heartbeat communication protocol module gives a minimum time interval T1 from the interruption of communication to the reporting of the brain splitting event to the brain splitting decision module. Then, the maximum time for the additional information channel from losing the primary node to re-electing the new primary node is T2.
  • the new primary node can be re-elected before the brain splitting event occurs due to the failure of the main node, thus ensuring the correctness of the first step of the splitting decision; Cracking is caused by a failure of a non-primary node or a heartbeat, and no re-election of the primary node is involved.
  • the following 1.4) to 1.7) are the decision methods made by the first-level decision method with the established information of the current main node after the occurrence of a splitting event:
  • the brain splitting decision module of each node in the desired main subset obtains a new member list from the brain split event message reported by the underlying heartbeat communication module, that is, the member column of the desired main subset. table.
  • the expected primary subset assigns a representative node from the new member list, which performs delayed Fence operations on other secondary subsets, ie, stops other subsets from working, avoiding multi-master.
  • the Fence operation requires a pre-delay (delay time is set to T d ) in order to make the Fence operation slower than the zero-delay Fence that may occur in the second step of the following step, so that the subset performs the second Step judgment. Therefore, T d is greater than the time consuming of the second step decision.
  • Figure 2 depicts the module cooperation and timing relationships for the first-step decision method in the event of a primary node failure.
  • Figure 3 depicts the module cooperation and timing relationships for the first-step decision method in the event of a non-primary node failure or heartbeat line break.
  • A, B, and C become a subset, and D becomes a subset that is split into two subsets, ⁇ A, B, C ⁇ , and ⁇ D ⁇ . If D happens to be the primary node, then ⁇ D ⁇ is judged as the primary subset by the first splitting decision, and ⁇ A, B, C ⁇ is judged as the secondary subset. In the end, the 3/4 computing power represented by the already working ⁇ A, B, C ⁇ subset was excluded from the cluster. This leads to a large waste of computing power.
  • the second step decision method begins work after the first-step decision. Its purpose is to try to make the subset with the largest number of nodes replace the expected major subset of the first decision as the true final main subset.
  • the second step of the judgment method is to use the latest member relationship, which is the information available, to make the judgment. The method is as follows:
  • each node In the normal case where no brain split occurs, each node records the member column of the current cluster. The number of tables and members. This information is provided by the underlying heartbeat communication module at the time of the last split event.
  • the second step decision module of each node of each subset also obtains the membership and the number of nodes of the subset from the brain split event message reported by the bottom heartbeat communication module. If the number of nodes in a subset exceeds 50% of the original cluster, all nodes in the subset can immediately determine that the subset is definitely the only subset with the largest number of nodes, that is, "the only large subset.”
  • the only large subset selects a representative node from the new member list, called the only large subset representative.
  • the node immediately performs a zero-delay Fence operation, letting all except the only large subset The nodes all stopped working.
  • the zero-delay Fence operation must be earlier than the expected major subset of 1.6) to represent the delay Fence, so zero-delay Fence can successfully stop the expected main subset. Therefore, the above step 1.7) will not be executed.
  • Figure 4 depicts the module cooperation and timing relationships for the second-step decision method when a unique subset is found and the only large subset is not the expected primary subset after the heartbeat is broken.
  • Figure 5 depicts the module cooperation and timing relationships for the second-step decision method when a unique large subset is found and the only large subset is identical to the expected primary subset after the heartbeat is broken.
  • the 3/4 computing power represented by the ⁇ A, B, C ⁇ subset is the only large subset, so it will replace ⁇ D ⁇ as the most The main subset of the end becomes a new cluster that can continue to work. Assuming that A is the only large subset of representatives, then the D node has not had time to get Fence ⁇ A, B, C ⁇ to be A Fence first.
  • the Fence mechanism can be a node-level Fence based on power management, and the Fence node will stop running the service due to loss of power.
  • the Fence may also be a node-level Fence based on the kernel Panic, and the node of the Fence may stop running the service due to the CPU stopping working.
  • the Fence mechanism is not limited to the above two mechanisms, and any technical means that can achieve the effect of any node of the Fence in each node in the cluster is within the scope of the present invention.
  • the additional information channel may be implemented using a decision disk based on a shared storage medium.
  • the specific interactions between the decision disk based on the shared storage medium and the first decision submodule are as follows:
  • a disk space is opened on the shared medium (such as iSCSI, AOE, SAN, etc.) as a decision disk.
  • the decision disk is spatially divided into several blocks.
  • Each node of the cluster is assigned a node ID that increments from zero. Then with this ID as an index, each node corresponds to a unique block (the block is also indexed from zero).
  • all nodes that can normally access the decision disk write the current time stamp to the corresponding block in the decision disk through the disk I/O operation.
  • Other nodes in the cluster determine whether a node is healthy based on whether the timestamp changes or not. A node is considered an unhealthy node if it cannot update its timestamp for a long time.
  • a node failure or a heartbeat line break event occurs. If it is a major node failure, the new primary node is selected during T2.
  • the additional information channel may also be based on an additional Ethernet network (not It is the heartbeat line) to achieve.
  • an additional Ethernet network not It is the heartbeat line
  • a node failure or a heartbeat line break event occurs. If it is a major node failure, the new primary node is selected during T2.
  • the additional information channel is not limited to the above two implementation manners, but in any implementation manner, it is within the scope of the present invention to obtain an implementation manner for determining whether the node is healthy or not, and the embodiment of the present invention is This is not limited.
  • the heartbeat communication module can use, but is not limited to, the Totem multicast communication protocol.
  • the service control logic module may use, but is not limited to, Pacemaker or AMF of OpenAIS.
  • a.1) to a.4) are the decision methods made by the first decision method after the occurrence of a splitting event by means of the established information of the current main node:
  • the subset of the primary node is immediately determined to be the primary subset of the expectation, and the other subsets are judged to be the secondary subset.
  • the split-brain decision module of each node in the desired primary subset derives a new list of members, ie, a list of members of the desired primary subset, from the split-brain event message reported by the underlying heartbeat communication module.
  • the expected primary subset assigns the node with the lowest IP in its member list as the primary primary child of the expectation, and performs the delayed Fence operation on the other secondary subset.
  • the delay time is T d .
  • the expected primary sub-representation After the T d time, the expected primary sub-representation performs the fence operation, and the expected major subset is finally judged as the main subset.
  • Each node of the desired primary subset reports this decision result to its respective service control logic module.
  • the following b.1) to b.6) are the second-step decision methods. It started working after the first step decision.
  • Each subset compares the number of members with the number of original cluster members: if the number of members of the subset is greater than 50% of the number of members of the original cluster, the subset considers itself to be the only large subset.
  • step B.2) If each subset is not the only large subset, the second step decision ends immediately. Go to step a.4).
  • a subset finds itself to be the only large subset and is not the main subset expected, then the only large subset assigns the node with the lowest IP in its member list to immediately all nodes except the only large subset Perform a zero-delay node-level Fence operation to stop them from working. Since the representative of the main subset expected is Fence, the above step a.4) will not be executed.
  • the embodiment of the invention provides a cluster splitting processing method, which can be applied to a node as shown in FIG. 1 , and the method is completed by a brain splitting decision module. Using this method, the flow of management control of the cluster during cluster splitting is shown in Figure 6, including:
  • Step 601 When a cluster split occurs, select a subset of the cluster that is allowed to continue to serve;
  • the current cluster member list, the number of members, and the cluster member change notification information are maintained at each node of the cluster.
  • the above information can be maintained by the heartbeat communication module of FIG.
  • Step 6011 selecting a subset of the main nodes before the occurrence of the brain splitting is a desired main subset
  • each node can know the previous main node.
  • the subset of the main node is selected as the expected main subset.
  • a disk space is opened on the shared medium as a decision disk, and the decision disk is partitioned, and each node in the cluster is uniquely corresponding to a partition of the decision disk, the cluster Each node in the middle writes the current timestamp to the corresponding partition in the decision disk through a disk I/O operation.
  • the continuous update timestamp indicates that the node is connected properly and belongs to the healthy node, and one of them can be selected as the primary node.
  • the selection rules can be configured as needed to configure the same rules on all nodes in the cluster.
  • the continuous update timestamp may be that the number of times the timestamp is updated within a time range is greater than a threshold.
  • the new primary node is again selected from the remaining healthy nodes.
  • Each node in the cluster periodically broadcasts or multicasts a KeepAlive message through an additional Ethernet network under normal conditions without brain splitting.
  • the keep-alive KeepAlive message indicates that the node is connected properly and belongs to the healthy node, and one of them can be selected as the primary node.
  • the selection rules can be configured as needed to configure the same rules on all nodes in the cluster.
  • the continuously issuing the KeepAlive message may be that the number of times the KeepAlive message is sent within a time range is greater than a threshold.
  • the new primary node is again selected from the remaining healthy nodes.
  • Step 6012 Assign a representative node from the desired primary subset, instructing the representative node to notify all nodes of each of the subsets except the expected primary subset to stop working after the first delay time.
  • Step 6013 Select a subset of the number of nodes that is greater than half of the number of cluster nodes before the occurrence of brain splitting. The only large subset;
  • This step is an optional step.
  • the subset is used as the only large subset.
  • Step 6014 Select a node from the unique large subset as the only large subset representative
  • This step is an optional step. When it is determined in step 6013 that there is a unique large subset, this step selects one node in the subset as the only large subset representative.
  • Step 6015 Instructing the unique large subset to determine that the unique large subset is different from the expected primary subset, and after the zero delay or the second delay time, notify the only one All nodes of other subsets outside the set stop working, and the second delay time is less than the first delay time;
  • This step is an optional step that is performed when there is a unique large subset.
  • the second delay time is less than the first delay time, so that it can be ensured that after the completion of the confirmation whether there is a unique large subset operation, the desired main subset is likely to issue a notification requesting that the other subsets stop working. It does not happen that there is a unique large subset, but before the only large subset is selected, the expected major subset notifies the node in the only large subset to stop working, resulting in a loss of processing power.
  • Step 6016 Select, from the desired primary subset and the unique large subset, a subset that is uniquely allowed to continue serving;
  • This step involves the following situations:
  • the desired primary subset is selected as the only subset that is allowed to continue the service
  • the subset is used as the only subset that allows for continued service
  • the unique large subset is used as the only subset that allows for continued service.
  • the only large subset is used as the only allowed to continue the service subset, it is also necessary to select a new primary node.
  • a node is elected from the unique large subset as a new primary node, and the new primary node is elected. The time consuming is from the time when the brain splitting is determined to the second time length, and the second time length is less than the first time length.
  • Step 602 Control the nodes in the other subsets except the subset that is allowed to continue to serve to stop working;
  • the nodes in the subset that are allowed to continue the service can be notified that the nodes in the other subsets stop working.
  • the embodiment of the present invention provides a cluster splitting processing device.
  • the structure of the device is as shown in FIG. 8 and includes:
  • the continuation service subset selection module 801 is configured to select a subset of the cluster that is allowed to continue to serve when the cluster has a brain split;
  • the node downtime control module 802 is configured to control the nodes in the other subsets except the subset that is only allowed to continue to service to stop working.
  • the structure of the continuation service subset selection module 801 is as shown in FIG. 9, and includes:
  • the main subset selection unit 8011 is configured to select a subset of the main nodes before the occurrence of the brain splitting as a desired main subset;
  • the only large subset selection unit 8012 is set to select the number of nodes larger than the cluster section before the occurrence of the brain splitting A subset of half the number of points as the only large subset;
  • the continuation service subset selection unit 8013 is arranged to select a subset that is uniquely allowed to continue from the desired primary subset and the unique large subset.
  • the continuation service subset selection module 801 further includes:
  • Representative node selection unit 8014 configured to assign a representative node from the desired primary subset, instructing the representative node to notify all nodes of the subset other than the desired primary subset to stop after the first delay time jobs.
  • the structure of the continuation service subset selection unit 8013 is as shown in FIG. 10, and includes:
  • a first selection sub-unit 1001 configured to select the desired primary subset as a subset that is uniquely allowed to continue to serve when there is no uniquely large subset;
  • a second selection sub-unit 1002 configured to use the subset as the only subset that allows for continued service when the desired primary subset is the same subset as the unique large subset;
  • the third selection sub-unit 1003 is configured to use the unique large subset as the only subset that allows for continued service when the desired primary subset and the unique large subset are different subsets.
  • the continuation service subset selection module 801 further includes:
  • a uniquely large subset representative selection unit 8015 is arranged to select a node from the unique large subset as the only large subset representative, indicating that the unique large subset representative determines the unique large subset and the desired primary child
  • the sets are different subsets, after zero delay or the second delay time, all nodes of the subset other than the unique large subset are notified to stop working, and the second delay time is less than the first delay time. .
  • the device further includes:
  • the internal communication management module 803 is configured to interrupt the communication between the underlying heartbeat communication of the node and the upper layer service control logic when the node detects that the heartbeat communication is interrupted, and after the first time length is reached, determine that the brain split occurs and restore the bottom layer. Communication between heartbeat communication and upper layer service control logic.
  • the apparatus further includes:
  • the primary node election module 804 is configured to elect a node from the only large subset The new primary node, the election of the new primary node takes time from the time when the brain splitting is determined to the second time length, and the second time length is less than the first time length.
  • the device further includes:
  • the storage module 805 is configured to maintain the current cluster member list, the number of members, and the cluster member change notification information.
  • the cluster splitting processing device can be integrated into the nodes in the cluster, and the node splitting processing method provided by the embodiment of the present invention is implemented by the node between the underlying heartbeat communication and the upper layer service control logic.
  • Embodiments of the present invention provide a cluster splitting processing method and apparatus.
  • a cluster splits a subset of the cluster that is allowed to continue to serve is selected, and other than the subset that is allowed to continue to serve is controlled.
  • the nodes in the subset stop working.
  • the orderly management of the cluster under the condition of cluster splitting is realized, and the control problem after cluster splitting is solved.
  • all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve.
  • the devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
  • the device/function module/functional unit in the above embodiment When the device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium.
  • the above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the embodiment of the invention realizes the orderly management of the cluster under the condition of cluster splitting, and solves the control problem of the post-brain splitting.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)

Abstract

A cluster split brain processing method and apparatus relate to the field of computer application. When a cluster is split into a plurality of subsets due to the occurrence of split brain, the only subset allowing service continuation is chosen; and nodes in the subsets, except the only subset allowing service continuation, are controlled to stop working.

Description

集群脑裂处理方法和装置Cluster brain splitting processing method and device 技术领域Technical field
本文涉及计算机应用领域,尤其涉及一种集群脑裂处理方法和装置。This paper relates to the field of computer applications, and in particular to a cluster brain splitting method and apparatus.
背景技术Background technique
高可用集群是以减少服务中断时间为目的的服务器集群技术。正在运行服务的节点称为主用机。不在运行服务,但后续可能接替主用机运行服务的节点称为备用机。当主用机故障后,备用机就会接手继续运行服务,达到提供持续的服务的效果。High availability clusters are server clustering technologies designed to reduce service downtime. The node that is running the service is called the primary machine. A node that is not running the service, but may subsequently take over the service running on the primary machine is called a standby machine. When the main machine fails, the standby machine will take over and continue to run the service to achieve the effect of providing continuous service.
节点间互联网络称为心跳线。通过心跳线,集群中的每个节点都可以跟任何其它节点进行通信,通过通信协议还可以获知当前集群中有哪些节点(下文将提供通信功能的模块称为“心跳通信模块”)。一旦一节点发现跟另外一个节点通信出现问题,则有可能是心跳线故障了,也有可能是对端节点故障了。总之,集群可能会分裂为多个子集。业界将这种情况叫做“脑裂”。当一子集中的节点无法了解其他子集失去联系的原因时,它不能猜测原因,更不能基于猜测来决定要不要运行服务(下文将控制服务启动或停止的模块称为“服务控制逻辑模块”),否则集群可能会出现丢失主用或出现多主的问题。The inter-node interconnection network is called a heartbeat line. Through the heartbeat line, each node in the cluster can communicate with any other node. Through the communication protocol, it can also know which nodes in the current cluster (the module that provides the communication function below is called "heartbeat communication module"). Once a node finds that there is a problem communicating with another node, there may be a heartbeat failure or a failure of the peer node. In summary, the cluster may split into multiple subsets. The industry calls this situation "brain cracking." When a node in a subset cannot understand the reason why other subsets lose contact, it can't guess the reason, and can't decide whether to run the service based on guessing (the module that controls or starts the service is called "service control logic module" below) ), otherwise the cluster may have problems with losing primary or multi-master.
举两个节点A,B组成集群为例,A上正运行服务,B作为备机。当节点B发现无法跟A通信时,如果它猜测是网络故障,则B会保持备机角色不变。但如果实际上是A节点故障,那么集群将失去主用,上层应用无法继续运行。反之,如果B节点猜测是A节点故障,则B将接替A来运行服务。但如果只是网络故障,A还在正常运行的话,则集群出现了A,B两个主用机。多个主用机的情况也是集群需要极力避免的,因为多个主用机互相竞争资源,严重情况下可能导致数据被破坏。Take two nodes A and B to form a cluster. For example, a service is running on A and B is used as a backup machine. When Node B finds that it cannot communicate with A, if it guesses that it is a network failure, then B will keep the standby role unchanged. However, if it is actually an A node failure, the cluster will lose its main use and the upper application will not continue to run. Conversely, if Node B guesses that Node A is faulty, then B will take over from A to run the service. But if it is only a network failure, A is still running normally, then there are two main machines A and B in the cluster. The situation of multiple active machines is also a cluster that needs to be avoided, because multiple active machines compete for resources with each other, and in severe cases, data may be destroyed.
综上,在集群发生脑裂时,如何继续对集群进行控制存在问题。 In summary, there are problems in how to continue to control the cluster when a cluster has a brain split.
发明内容Summary of the invention
本文提供了一种集群脑裂处理方法和装置,解决了集群脑裂后的控制问题。This paper provides a clustering method and device for brain splitting, which solves the problem of post-brain cracking control.
一种集群脑裂处理方法,包括:A cluster brain splitting method includes:
在集群发生脑裂而分裂为多个子集时,从所述多个子集中选取唯一允许继续服务的子集;When a cluster splits into multiple subsets, a subset that is uniquely allowed to continue to be serviced is selected from the plurality of subsets;
控制除所述唯一允许继续服务的子集外的其他子集中的节点停止工作。Controls the nodes in other subsets except the subset that is allowed to continue to serve to stop working.
可选地,在集群发生脑裂而分裂为多个子集时,从所述多个子集选取唯一允许继续服务的子集包括:Optionally, when the cluster splits into multiple subsets, selecting a subset from the plurality of subsets that is allowed to continue to serve includes:
选取在脑裂发生前的主要节点所在子集为期望的主要子集;Select the subset of the main nodes before the occurrence of the brain splitting as the main subset expected;
选取节点数量大于脑裂发生前集群节点数量一半的子集作为唯一大子集;Selecting a subset of nodes that is greater than half the number of cluster nodes before the occurrence of cerebral rupture as the only large subset;
从所述期望的主要子集和所述唯一大子集中,选择唯一允许继续服务的子集。From the desired primary subset and the unique large subset, select a subset that is uniquely allowed to continue service.
可选地,该方法还包括:Optionally, the method further includes:
在集群初始化时,在共享介质上开辟一块磁盘空间作为判决盘,将所述判决盘分区,将所述集群中的每个节点唯一对应到所述判决盘的一个分区上;When the cluster is initialized, a disk space is opened on the shared medium as a decision disk, and the decision disk is partitioned, and each node in the cluster is uniquely corresponding to a partition of the decision disk;
所述集群中的每个节点通过磁盘输入/输出I/O操作向所述判决盘中对应的分区里写入当前时间戳;Each node in the cluster writes a current timestamp to a corresponding partition in the decision disk through a disk input/output I/O operation;
选择在一时间范围内更新时间戳的次数大于阈值的节点之一作为主要节点。One of the nodes whose number of times of updating the time stamp is greater than the threshold in a time range is selected as the primary node.
可选地,该方法还包括:Optionally, the method further includes:
所述集群中的每个节点在没有发生脑裂的正常情况下,通过额外的以太网周期广播或组播KeepAlive消息;Each node in the cluster broadcasts or multicasts a KeepAlive message through an additional Ethernet period under normal conditions without brain splitting;
选择在一时间范围内发出所述KeepAlive消息的次数大于阈值的节点之一作为主要节点。 One of the nodes that issued the KeepAlive message for a number of times greater than the threshold within a time range is selected as the primary node.
可选地,选取在脑裂发生前的主要节点所在子集为期望的主要子集的步骤之后,该方法还包括:Optionally, after the step of selecting a subset of the primary nodes before the occurrence of the brain splitting is a desired primary subset, the method further includes:
从所述期望的主要子集中指派一个代表节点,指示所述代表节点在第一延迟时间之后通知除所述期望的主要子集外的其他每个子集的全部节点停止工作。A representative node is assigned from the desired primary subset, instructing the representative node to notify all nodes of each of the subsets other than the desired primary subset to stop working after the first delay time.
可选地,从所述期望的主要子集和所述唯一大子集中,选择唯一允许继续服务的子集包括:Optionally, selecting a subset that is uniquely allowed to continue from the desired primary subset and the only large subset includes:
在不存在唯一大子集时,选择所述期望的主要子集作为唯一允许继续服务的子集;When there is no unique large subset, the desired primary subset is selected as the only subset that is allowed to continue the service;
在所述期望的主要子集与所述唯一大子集为同一子集时,以该子集作为唯一允许继续服务的子集;When the desired primary subset is the same subset as the unique large subset, the subset is used as the only subset that allows for continued service;
在所述期望的主要子集与所述唯一大子集为不同子集时,以所述唯一大子集作为唯一允许继续服务的子集。When the desired primary subset and the unique large subset are different subsets, the unique large subset is used as the only subset that allows for continued service.
可选地,选取节点数量大于脑裂发生前集群节点数量一半的子集作为唯一大子集的步骤之后,该方法还包括:Optionally, after the step of selecting a subset of the number of nodes that is greater than half of the number of cluster nodes before the occurrence of the brain split as the only large subset, the method further includes:
从所述唯一大子集中选择一个节点作为唯一大子集代表;Selecting a node from the only large subset as the only large subset representative;
指示所述唯一大子集代表判定所述唯一大子集与所述期望的主要子集为不同子集时,在零时延或第二延迟时间后,通知除所述唯一大子集外的其他子集的全部节点停止工作,所述第二延迟时间小于所述第一延迟时间。Instructing the unique large subset representative to determine that the unique large subset is different from the expected primary subset, and after zero delay or the second delay time, notify the other than the only large subset All nodes of the other subsets stop working, and the second delay time is less than the first delay time.
可选地,该方法还包括:Optionally, the method further includes:
在节点检测到心跳线通信发生中断时,中断该节点底层心跳通信与上层服务控制逻辑间的通信,至达到第一时间长度后,判定脑裂发生,恢复所述底层心跳通信与上层服务控制逻辑间的通信。When the node detects that the heartbeat communication is interrupted, the communication between the underlying heartbeat communication of the node and the upper layer service control logic is interrupted, and after reaching the first time length, the occurrence of brain splitting is determined, and the underlying heartbeat communication and the upper layer service control logic are restored. Communication between.
可选地,在以所述唯一大子集作为唯一允许继续服务的子集时,该方法还包括:Optionally, when the only large subset is used as the only subset that allows to continue the service, the method further includes:
从所述唯一大子集中选举一个节点作为新的主要节点,选举所述新的主 要节点耗时从判定发生脑裂时起至第二时间长度止,所述第二时间长度小于所述第一时间长度。Electing a node from the only large subset as the new primary node, electing the new master The second time length is less than the first time length from the time when the node is determined to have a brain splitting to the second time length.
可选地,该方法还包括:Optionally, the method further includes:
在所述集群的每个节点维护当前集群成员列表、成员数量和集群成员变化通知信息。The current cluster member list, the number of members, and the cluster member change notification information are maintained at each node of the cluster.
一种集群脑裂处理装置,包括:A cluster splitting device includes:
继续服务子集选择模块,设置为:在集群发生脑裂而分裂为多个子集时,从所述多个子集中选取唯一允许继续服务的子集;以及The service subset selection module is further configured to: when the cluster splits into a plurality of subsets, select a subset from the plurality of subsets that is allowed to continue to serve;
节点停工控制模块,设置为:控制除所述唯一允许继续服务的子集外的其他子集中的节点停止工作。The node shutdown control module is configured to: control the nodes in the other subsets except the subset that is allowed to continue to serve to stop working.
可选地,所述继续服务子集选择模块包括:Optionally, the continuing service subset selection module includes:
期望主要子集选取单元,设置为:选取在脑裂发生前的主要节点所在子集为期望的主要子集;The main subset selection unit is expected to be set to: select a subset of the main nodes before the occurrence of the brain splitting as a desired main subset;
唯一大子集选取单元,设置为:选取节点数量大于脑裂发生前集群节点数量一半的子集作为唯一大子集;以及The only large subset selection unit is set to: select the subset whose number of nodes is greater than half the number of cluster nodes before the occurrence of the brain split as the only large subset;
继续服务子集选取单元,设置为:从所述期望的主要子集和所述唯一大子集中,选择唯一允许继续服务的子集。The continuation service subset selection unit is configured to select a subset that is uniquely allowed to continue from the desired primary subset and the unique large subset.
可选地,所述继续服务子集选择模块还包括:Optionally, the continuing service subset selection module further includes:
代表节点选择单元,设置为:从所述期望的主要子集中指派一个代表节点,指示所述代表节点在第一延迟时间之后通知除所述期望的主要子集外的其他每个子集的全部节点停止工作。a representative node selecting unit, configured to: assign a representative node from the desired primary subset, instructing the representative node to notify all nodes of each of the subsets except the expected primary subset after the first delay time stop working.
可选地,所述继续服务子集选取单元包括:Optionally, the continuing service subset selection unit includes:
第一选取子单元,设置为:在不存在唯一大子集时,选择所述期望的主要子集作为唯一允许继续服务的子集;a first selection sub-unit, configured to: when the unique large subset does not exist, select the desired primary subset as a subset that is uniquely allowed to continue serving;
第二选取子单元,设置为:在所述期望的主要子集与所述唯一大子集为 同一子集时,以该子集作为唯一允许继续服务的子集;以及a second selection subunit, configured to: in the desired primary subset and the only large subset The same subset, with this subset as the only subset that is allowed to continue the service;
第三选取子单元,设置为:在所述期望的主要子集与所述唯一大子集为不同子集时,以所述唯一大子集作为唯一允许继续服务的子集。And a third selection subunit, configured to: when the expected primary subset and the unique large subset are different subsets, use the unique large subset as the only subset that allows to continue the service.
可选地,所述继续服务子集选择模块还包括:Optionally, the continuing service subset selection module further includes:
唯一大子集代表选择单元,设置为:从所述唯一大子集中选择一个节点作为唯一大子集代表,指示所述唯一大子集代表判定所述唯一大子集与所述期望的主要子集为不同子集时,在零时延或第二延迟时间后,通知除所述唯一大子集外的其他子集的全部节点停止工作,所述第二延迟时间小于所述第一延迟时间。The only large subset represents a selection unit, configured to: select a node from the unique large subset as a unique large subset representation, indicating that the unique large subset represents determining the unique large subset and the desired primary child When the sets are different subsets, after zero delay or the second delay time, all nodes of the subset other than the unique large subset are notified to stop working, and the second delay time is less than the first delay time. .
可选地,该装置还包括:Optionally, the device further includes:
内部通信管理模块,设置为:在节点检测到心跳线通信发生中断时,中断该节点底层心跳通信与上层服务控制逻辑间的通信,至达到第一时间长度后,判定脑裂发生,恢复所述底层心跳通信与上层服务控制逻辑间的通信。The internal communication management module is configured to: when the node detects that the heartbeat communication is interrupted, interrupt communication between the bottom heartbeat communication of the node and the upper service control logic, and after reaching the first time length, determine the occurrence of brain splitting, and restore the Communication between the underlying heartbeat communication and the upper layer service control logic.
可选地,在以所述唯一大子集作为唯一允许继续服务的子集时,该装置还包括:Optionally, when the only large subset is used as the only subset that allows to continue the service, the apparatus further includes:
主要节点选举模块,设置为:从所述唯一大子集中选举一个节点作为新的主要节点,选举所述新的主要节点耗时从判定发生脑裂时起至第二时间长度止,所述第二时间长度小于所述第一时间长度。The primary node election module is configured to: elect a node from the unique large subset as a new primary node, and the election of the new primary node takes time from the time when the brain split is determined to the second time length, the first The second time length is less than the first time length.
可选地,该装置还包括:Optionally, the device further includes:
存储模块,设置为:维护当前集群成员列表、成员数量和集群成员变化通知信息。The storage module is set to maintain the current cluster member list, the number of members, and the cluster member change notification information.
一种计算机可读存储介质,存储有程序指令,当该程序指令被执行时可实现上述方法。A computer readable storage medium storing program instructions that are implemented when the program instructions are executed.
本文提供了一种集群脑裂处理方法和装置,在集群发生脑裂时,选取该集群中唯一允许继续服务的子集,控制除所述唯一允许继续服务的子集外的 其他子集中的节点停止工作。实现了集群脑裂情况下对集群的有序管理,解决了集群脑裂后的控制问题。This paper provides a method and device for processing a cluster splitting. When a cluster has a brain split, the only subset of the cluster that is allowed to continue to serve is selected, and the control is controlled except for the only subset that allows the service to continue. Nodes in other subsets stop working. The orderly management of the cluster under the condition of cluster splitting is realized, and the control problem after cluster splitting is solved.
附图概述BRIEF abstract
图1为本发明的实施例一提供的一种集群脑裂处理系统的示意图;1 is a schematic diagram of a cluster splitting processing system according to Embodiment 1 of the present invention;
图2为第一步判决方法在主要节点故障时的模块协作以及时序关系示意图;2 is a schematic diagram of module cooperation and timing relationship of a first-stage decision method in the event of a primary node failure;
图3为第一步判决方法在非主要节点故障或者心跳线断时的模块协作以及时序关系示意图;3 is a schematic diagram of module cooperation and timing relationship of a first-step decision method when a non-primary node fails or a heartbeat line breaks;
图4为第二步判决方法在心跳线断后,发现唯一大子集且唯一大子集不是期望的主要子集时的模块协作以及时序关系示意图;4 is a schematic diagram of module cooperation and timing relationship when the second step decision method finds that only a large subset and only a large subset is not the main subset expected after the heartbeat line breaks;
图5为二步判决方法在心跳线断后,发现唯一大子集且唯一大子集与期望的主要子集相同时的模块协作以及时序关系示意图;5 is a schematic diagram of module cooperation and timing relationship when a two-step decision method finds a unique large subset and a unique large subset is identical to a desired primary subset after a heartbeat line break;
图6为本发明的实施例二提供的一种集群脑裂处理方法的流程图;6 is a flowchart of a cluster splitting processing method according to Embodiment 2 of the present invention;
图7为图6中步骤601的具体流程图;Figure 7 is a specific flow chart of step 601 of Figure 6;
图8为本发明的实施例三提供的一种集群脑裂处理装置的结构示意图;FIG. 8 is a schematic structural diagram of a cluster splitting device according to Embodiment 3 of the present invention; FIG.
图9为图8中继续服务子集选择模块801的结构示意图;FIG. 9 is a schematic structural diagram of the continuation service subset selection module 801 of FIG. 8;
图10为图9中继续服务子集选取单元8013的结构示意图。FIG. 10 is a schematic structural diagram of the continuation service subset selection unit 8013 in FIG.
本发明的实施方式Embodiments of the invention
举两个节点A,B组成集群为例,A上正运行服务,B作为备机。当节点B发现无法跟A通信时,如果它猜测是网络故障,则B会保持备机角色不变。但如果实际上是A节点故障,那么集群将失去主用,上层应用无法继续运行。反之,如果B节点猜测是A节点故障,则B将接替A来运行服务。但如果只是网络故障,A还在正常运行的话,则集群出现了A,B两个主用机。多个主用机的情况也是集群需要极力避免的,因为多个主用机互相竞争资源,严重情况下可能导致数据被破坏。 Take two nodes A and B to form a cluster. For example, a service is running on A and B is used as a backup machine. When Node B finds that it cannot communicate with A, if it guesses that it is a network failure, then B will keep the standby role unchanged. However, if it is actually an A node failure, the cluster will lose its main use and the upper application will not continue to run. Conversely, if Node B guesses that Node A is faulty, then B will take over from A to run the service. But if it is only a network failure, A is still running normally, then there are two main machines A and B in the cluster. The situation of multiple active machines is also a cluster that needs to be avoided, because multiple active machines compete for resources with each other, and in severe cases, data may be destroyed.
为了解决上述问题,本发明的实施例提供了一种集群脑裂处理方法和装置。下文中将结合附图对本发明的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。In order to solve the above problems, embodiments of the present invention provide a cluster splitting processing method and apparatus. Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
下面结合附图,对本发明的实施例一进行说明。Embodiment 1 of the present invention will be described below with reference to the accompanying drawings.
本发明实施例提供了一种集群脑裂处理系统,该系统的结构如图1所示,包括底层的心跳通信模块和上层的服务逻辑模块,以及介于服务控制逻辑模块和心跳通信模块之间的脑裂判决模块。心跳通信模块为脑裂判决模块提供当前集群成员列表、成员数量和集群成员变化通知(或称作脑裂事件)等信息。脑裂判决模块通过这些信息判断出哪一个子集应该继续运行服务,并将判断结果上报给服务控制逻辑模块,后者会根据该结果进行必要的主备倒换等服务控制操作。The embodiment of the present invention provides a cluster splitting processing system. The structure of the system is as shown in FIG. 1 , including an underlying heartbeat communication module and an upper layer service logic module, and between the service control logic module and the heartbeat communication module. Brain splitting decision module. The heartbeat communication module provides the brain splitting decision module with information such as the current cluster member list, the number of members, and the cluster member change notification (or called a brain split event). The brain splitting decision module uses this information to determine which subset should continue to run the service, and reports the judgment result to the service control logic module, which performs necessary service control operations such as active/standby switchover based on the result.
集群由于脑裂而分列为多个子集后,脑裂判决模块负责判决出唯一的一个允许继续运行服务的子集,该子集被称为“主要子集”。其他子集上的服务则需要被停止工作(或称作被Fence),这些子集被称为“次要子集”。本发明实施例中,对被Fence的节点立即下电或者重启,停止工作。第二,主要子集尽可能是在所有分裂出来的子集中,节点数量最多的子集。这样才能保证在脑裂后,大多数节点可以继续工作。第三,脑裂事件发生后,脑裂判决模块能根据脑裂事件发生时即有的信息立即确定主要子集,保证上层服务控制逻辑能够尽快进行主备倒换。After the cluster is divided into multiple subsets due to brain splitting, the split-brain decision module is responsible for determining the only subset that allows the service to continue to run. This subset is called the "primary subset." Services on other subsets need to be stopped (or called Fence), and these subsets are called "secondary subsets." In the embodiment of the present invention, the node that is being Fence is powered off or restarted immediately, and stops working. Second, the main subset is as close as possible to the subset with the largest number of nodes in all split sub-sets. This will ensure that most nodes can continue to work after the split. Third, after the brain splitting event, the brain splitting decision module can immediately determine the main subset according to the information that occurs when the brain splitting event occurs, and ensure that the upper layer service control logic can perform the master/slave switching as soon as possible.
为实现上述特点,本发明实施例中的脑裂判决模块实现了一种两步判决方法:第一步:在脑裂事件发生时,首先通过额外信息渠道立即判决出“期望的主要子集”。额外信息渠道指的是除了心跳线之外,集群节点间能够进行信息交换的其他渠道;第二步:若存在一个子集,其节点数量大于脑裂前的集群节点数量的50%,那它一定是所有子集中最大的一个(下称“唯一大子集”)。如果唯一大子集不是第一步判决出的期望的主要子集,则唯一大子集立即取代第一步判决出的期望的主要子集,判决为最终的可继续运行服务的子集(下称“主要子集”)。如果第二步中没有找到唯一大子集,或者 唯一大子集就是第一步中找到的期望的主要子集,则第二步判决不工作,期望的主要子集被判决为最终的主要子集。In order to achieve the above features, the brain splitting decision module in the embodiment of the present invention implements a two-step decision method: Step 1: In the event of a splitting event, the "preferred main subset" is first determined through an additional information channel. . The additional information channel refers to other channels that can exchange information between cluster nodes in addition to the heartbeat line; the second step: if there is a subset, the number of nodes is greater than 50% of the number of cluster nodes before the brain split, then it It must be the largest of all subsets (hereinafter referred to as "the only large subset"). If the only large subset is not the primary subset of the expected decision in the first step, then the only large subset immediately replaces the expected primary subset of the first decision, and the decision is the final subset of the service that can continue to run (below) Called "main subset"). If the second subset does not find a single large subset, or The only large subset is the main subset of expectations found in the first step, then the second step decision does not work, and the main subset of expectations is judged to be the final major subset.
第一步判决方法由第一部判决子模块实现,第二步判决由第二步判决子模块实现。为保证第一步判决的正常工作,本发明实施例中要求额外信息渠道对第一步判决子模块提供如下1.1)到1.3)的信息交互能力:The first decision method is implemented by the first decision sub-module, and the second step decision is implemented by the second step decision sub-module. In order to ensure the normal operation of the first step decision, the embodiment of the present invention requires an additional information channel to provide the information interaction capability of the first step decision sub-module as follows: 1.1) to 1.3):
1.1)在没有发生脑裂的正常情况下,所有能正常访问额外信息渠道的节点都通过额外信息渠道向集群中的其他节点表明自己处于正常态。这种通过额外信息渠道确认的,处于正常态的节点被称为“健康”节点。1.1) In the normal situation where no brain split occurs, all nodes that can access the additional information channels normally indicate to other nodes in the cluster that they are in a normal state through additional information channels. This node that is confirmed by additional information channels and is in a normal state is called a "health" node.
1.2)在没有发生脑裂的正常情况下,所有健康节点都通过额外信息渠道从集群中选举出唯一的一个节点作为所谓“主要节点”。“主要节点”必然是健康节点,但健康节点不一定是主要节点。1.2) In the normal case of no brain splitting, all healthy nodes elect a unique node from the cluster through the additional information channel as the so-called "primary node". The "primary node" must be a healthy node, but the healthy node is not necessarily the primary node.
1.3)主要节点的重新选举:如果脑裂是由于上述主要节点故障而导致的,那如果在第一步脑裂判决时,新的主要节点尚未选举出来,则第一步脑裂判决会因为找不到主要节点而失去判决能力。为避免这种情况的发生,底层心跳通信协模块给出一个从通信中断,到上报脑裂事件到脑裂判决模块的最小时间间隔T1。然后,额外信息渠道从失去主要节点到重新选举新的主要节点的最长时间为T2。那么,只要保证T2<T1,就能保证在由于主要节点故障而导致脑裂事件发生之前,新的主要节点就已经重选完毕了,从而保证第一步脑裂判决的正确性;如果集群脑裂是由于非主要节点故障或者心跳线断导致的,则不会涉及到主要节点的重新选举。1.3) Re-election of the main node: If the brain split is caused by the failure of the above-mentioned main node, then if the new main node has not been elected in the first step of the splitting decision, the first step of the splitting decision will be Loss of judgment ability without the main node. In order to avoid this, the underlying heartbeat communication protocol module gives a minimum time interval T1 from the interruption of communication to the reporting of the brain splitting event to the brain splitting decision module. Then, the maximum time for the additional information channel from losing the primary node to re-electing the new primary node is T2. Then, as long as T2 < T1 is guaranteed, the new primary node can be re-elected before the brain splitting event occurs due to the failure of the main node, thus ensuring the correctness of the first step of the splitting decision; Cracking is caused by a failure of a non-primary node or a heartbeat, and no re-election of the primary node is involved.
下面1.4)到1.7)是在脑裂事件发生后,第一步判决方法借助当前主要节点这个既定的信息所做出的判决方法:The following 1.4) to 1.7) are the decision methods made by the first-level decision method with the established information of the current main node after the occurrence of a splitting event:
1.4)当脑裂发生后,主要节点所在的子集被立即判决为所谓“期望的主要子集”,其他子集则被判决为次要子集。但该判决结果先不上报到服务控制逻辑。1.4) When a split occurs, the subset of the primary nodes is immediately judged as the so-called "primary subset of expectations", and the other subsets are judged to be secondary subsets. However, the result of the judgment is not reported to the service control logic.
1.5)期望的主要子集中的每个节点的脑裂判决模块从底层心跳通信模块上报的脑裂事件消息中得到新的成员列表,也就是期望的主要子集的成员列 表。1.5) The brain splitting decision module of each node in the desired main subset obtains a new member list from the brain split event message reported by the underlying heartbeat communication module, that is, the member column of the desired main subset. table.
1.6)期望的主要子集从新的成员列表中指派一个代表节点,该节点对其它次要子集进行带延迟的Fence操作,即让其它子集停止工作,避免出现多主。这里的Fence操作之所以要求前置一个延迟(延迟时间设为Td),是为了让该Fence操作一定慢于下面第二步判决中可能发生的零延迟的Fence,以便让子集执行第二步判决。所以Td要大于第二步判决的耗时。1.6) The expected primary subset assigns a representative node from the new member list, which performs delayed Fence operations on other secondary subsets, ie, stops other subsets from working, avoiding multi-master. Here, the Fence operation requires a pre-delay (delay time is set to T d ) in order to make the Fence operation slower than the zero-delay Fence that may occur in the second step of the following step, so that the subset performs the second Step judgment. Therefore, T d is greater than the time consuming of the second step decision.
1.7)如果第二步判决并未真正起作用,则Td时间后,“期望的主要子集”的代表将执行Fence操作,“期望的主要子集”最终判决为主要子集。判决结果上报到服务控制逻辑。1.7) If the second step decision does not really work, then after the T d time, the representative of the "desired primary subset" will perform the Fence operation, and the "desired primary subset" will ultimately be the primary subset. The judgment result is reported to the service control logic.
图2描述了第一步判决方法在主要节点故障时的模块协作以及时序关系。图3描述了第一步判决方法在非主要节点故障或者心跳线断时的模块协作以及时序关系。Figure 2 depicts the module cooperation and timing relationships for the first-step decision method in the event of a primary node failure. Figure 3 depicts the module cooperation and timing relationships for the first-step decision method in the event of a non-primary node failure or heartbeat line break.
在上述第一步脑裂判决后,已经足够避免多主用和失去主用问题的发生。但可能还会发生“大多数可工作的节点失去主用,即集群失去大部分计算能力”的问题。举个大于两个节点的集群脑裂的例子来说明:假设四个节点A,B,C,D组成集群{A,B,C,D},且配置了符合第一步判决方法要求的额外信息渠道和Fence功能。然后,假设由于心跳线故障导致脑裂发生,所有节点都还能正常工作。结果是A,B,C成为一个子集,D成为一个子集,共分裂为{A,B,C}和{D}两个子集。如果D恰好是主要节点,则{D}被第一步脑裂判决判为主要子集,{A,B,C}被判为次要子集。最终,本来可以工作的{A,B,C}子集所代表的3/4的计算能力被排除在了集群之外。导致很大的计算能力浪费。After the above-mentioned first step of the splitting decision, it is enough to avoid the occurrence of multi-active and loss of the main problem. But it may also happen that "most of the working nodes lose their main use, that is, the cluster loses most of its computing power." Let's take an example of a cluster splitting of more than two nodes: Suppose four nodes A, B, C, and D form a cluster {A, B, C, D}, and are configured with additional requirements that meet the requirements of the first-step decision method. Information channels and Fence features. Then, assuming that the brain split occurs due to a heartbeat failure, all nodes will still work. The result is that A, B, and C become a subset, and D becomes a subset that is split into two subsets, {A, B, C}, and {D}. If D happens to be the primary node, then {D} is judged as the primary subset by the first splitting decision, and {A, B, C} is judged as the secondary subset. In the end, the 3/4 computing power represented by the already working {A, B, C} subset was excluded from the cluster. This leads to a large waste of computing power.
下面详述第二步判决方法。第二步判决方法紧接着第一步判决之后开始工作,它的目的是尽量让节点数量最多的子集替代第一步判决出的期望的主要子集作为真正的最终的主要子集。第二步判决方法是借助最新成员关系这一即有的信息来进行判决的,方法如下:The second step decision method is detailed below. The second-step decision method begins work after the first-step decision. Its purpose is to try to make the subset with the largest number of nodes replace the expected major subset of the first decision as the true final main subset. The second step of the judgment method is to use the latest member relationship, which is the information available, to make the judgment. The method is as follows:
2.1)在没有发生脑裂的正常情况下,每个节点都记录当前集群的成员列 表和成员数量。该信息由底层心跳通信模块在上一次脑裂事件发生时就已提供。2.1) In the normal case where no brain split occurs, each node records the member column of the current cluster. The number of tables and members. This information is provided by the underlying heartbeat communication module at the time of the last split event.
2.2)脑裂发生后,每个子集的每个节点的第二步判决模块也从底层心跳通信模块上报的脑裂事件消息中得到所在子集的成员关系和节点数量。如果一子集的节点数量超过原集群的50%,则该子集中的所有节点都能立即确定本子集肯定是唯一的一个节点数量最多的子集,即“唯一大子集”。2.2) After the brain splitting occurs, the second step decision module of each node of each subset also obtains the membership and the number of nodes of the subset from the brain split event message reported by the bottom heartbeat communication module. If the number of nodes in a subset exceeds 50% of the original cluster, all nodes in the subset can immediately determine that the subset is definitely the only subset with the largest number of nodes, that is, "the only large subset."
2.3)唯一大子集从新的成员列表中选择一个代表节点,叫做唯一大子集代表。2.3) The only large subset selects a representative node from the new member list, called the only large subset representative.
2.4)如果唯一大子集代表发现本子集并不是第一步判决出的期望的主要子集,则该节点立即执行一个零时延的Fence操作,让除唯一大子集之外的其他所有节点都停止工作。零时延的Fence操作肯定早于1.6)中的期望的主要子集代表所做的带延迟Fence,所以零时延的Fence肯定能成功地让期望的主要子集停止工作。从而上述1.7)步不会被执行到。2.4) If the only large subset of representatives finds that this subset is not the primary subset of the expected decision in the first step, then the node immediately performs a zero-delay Fence operation, letting all except the only large subset The nodes all stopped working. The zero-delay Fence operation must be earlier than the expected major subset of 1.6) to represent the delay Fence, so zero-delay Fence can successfully stop the expected main subset. Therefore, the above step 1.7) will not be executed.
2.5)唯一大子集最终被判决为主要子集。该结果被上报到服务控制逻辑。2.5) The only large subset is ultimately judged as the main subset. The result is reported to the service control logic.
2.6)由于其它子集都停止工作而不再健康,所以失去了竞争主要节点的能力。T2时间后,新的主要节点必然将在唯一大子集中重新选举产生,为下一次新的脑裂判决做准备。根据上述1.3)中的时间约束,本次重新选举主要节也不会被新的脑裂事件打断。2.6) Because other subsets stop working and are no longer healthy, they lose the ability to compete for the primary node. After the T2 time, the new primary node will inevitably be re-elected in the only large subset to prepare for the next new splitting decision. According to the time constraints in 1.3) above, the main section of this re-election will not be interrupted by the new brain splitting event.
2.7)如果在2.2)步中未发现唯一大子集,或者在2.4)中发现唯一大子集就是期望的主要子集,则不需要执行第二步脑裂判决,即不会走到2.5)和2.6)步,第一步判决的1.7)步将被执行。2.7) If the only large subset is not found in step 2.2), or if the only large subset is found to be the main subset expected in 2.4), then the second step of the splitting decision is not required, ie it will not go to 2.5) And step 2.6), step 1.7) of the first decision will be executed.
图4描述了第二步判决方法在心跳线断后,发现唯一大子集且唯一大子集不是期望的主要子集时的模块协作以及时序关系。图5描述了第二步判决方法在心跳线断后,发现唯一大子集且唯一大子集与期望的主要子集相同时的模块协作以及时序关系。Figure 4 depicts the module cooperation and timing relationships for the second-step decision method when a unique subset is found and the only large subset is not the expected primary subset after the heartbeat is broken. Figure 5 depicts the module cooperation and timing relationships for the second-step decision method when a unique large subset is found and the only large subset is identical to the expected primary subset after the heartbeat is broken.
应用第二步判决方法后,再分析上面五个节点的集群的例子:{A,B,C}子集所代表的3/4的计算能力因为是唯一大子集,所以它将替代{D}作为最 终的主要子集,成为可继续工作的新集群。假设A是唯一大子集代表,则D节点还没来得及Fence{A,B,C}就先被A Fence了。After applying the second-step decision method, analyze the cluster of the above five nodes: the 3/4 computing power represented by the {A, B, C} subset is the only large subset, so it will replace {D } as the most The main subset of the end becomes a new cluster that can continue to work. Assuming that A is the only large subset of representatives, then the D node has not had time to get Fence{A, B, C} to be A Fence first.
Fence机制可以是基于电源管理的节点级Fence,被Fence的节点会由于失去电力而停止运行服务。The Fence mechanism can be a node-level Fence based on power management, and the Fence node will stop running the service due to loss of power.
具体实施例中,Fence也可以是基于内核Panic的节点级Fence,被Fence的节点会由于CPU停止工作而停止运行服务。In a specific embodiment, the Fence may also be a node-level Fence based on the kernel Panic, and the node of the Fence may stop running the service due to the CPU stopping working.
总之,Fence机制不限于上述两种机制,只要能达到集群中每个节点都能Fence其它任何节点的效果的技术手段,均在本发明保护范围中。In summary, the Fence mechanism is not limited to the above two mechanisms, and any technical means that can achieve the effect of any node of the Fence in each node in the cluster is within the scope of the present invention.
具体实施例中,额外信息渠道可以是使用基于共享存储介质的判决盘来实现。为得到主要节点,基于共享存储介质的判决盘和第一步判决子模块的具体交互如下:In a specific embodiment, the additional information channel may be implemented using a decision disk based on a shared storage medium. To obtain the primary node, the specific interactions between the decision disk based on the shared storage medium and the first decision submodule are as follows:
1)集群初始化时,在共享介质(比如iSCSI,AOE,SAN等)上开辟一块磁盘空间作为判决盘。判决盘在空间上被划分为若干个区块。集群的每个节点被分配一个从零开始递增的节点ID。然后以此ID作为索引,每个节点就对应到唯一的一个区块上(区块也是从零开始索引)。1) When the cluster is initialized, a disk space is opened on the shared medium (such as iSCSI, AOE, SAN, etc.) as a decision disk. The decision disk is spatially divided into several blocks. Each node of the cluster is assigned a node ID that increments from zero. Then with this ID as an index, each node corresponds to a unique block (the block is also indexed from zero).
2)在没有发生脑裂的正常情况下,所有能正常访问判决盘的节点都通过磁盘I/O操作向判决盘中自己对应的区块里写入当前时间戳。集群中的其他节点根据这个时间戳的变化与否来判断某个节点是否是健康的。如果某个节点长时间不能更新它的时间戳,则被认为是不健康节点。2) In the normal case where no brain split occurs, all nodes that can normally access the decision disk write the current time stamp to the corresponding block in the decision disk through the disk I/O operation. Other nodes in the cluster determine whether a node is healthy based on whether the timestamp changes or not. A node is considered an unhealthy node if it cannot update its timestamp for a long time.
3)在集群中的每个节点配置同样的主要节点选取规则,如在没有发生脑裂的正常情况下,所有健康节点都认为健康且索引最小的一个节点就是唯一的主要节点;也可以选择索引最大的一个节点作为主要节点。本发明对此不作限定,只要能达到选择出唯一健康节点作为主要节点的实现方式,均在本发明保护范围内。3) Configure the same primary node selection rule for each node in the cluster. For example, in the normal case where no brain split occurs, all healthy nodes consider health and the node with the smallest index is the only primary node; you can also select the index. The largest one node acts as the primary node. The present invention is not limited thereto, and it is within the scope of the present invention as long as the implementation method of selecting the only healthy node as the primary node can be achieved.
4)节点故障或者心跳线断事件发生,如果恰好是主要节点故障,则在T2时间内,新的主要节点被选出来。4) A node failure or a heartbeat line break event occurs. If it is a major node failure, the new primary node is selected during T2.
具体实施例中,额外信息渠道也可以是使用基于额外的Ethernet网络(不 是心跳线)来实现。为得到主要节点,额外的Ethernet网络和第一步判决子模块的交互如下:In a specific embodiment, the additional information channel may also be based on an additional Ethernet network (not It is the heartbeat line) to achieve. To get the primary node, the interaction between the additional Ethernet network and the first decision submodule is as follows:
2)在集群中的每个节点配置同样的主要节点选择规则。如,在没有发生脑裂的正常情况下,所有健康节点都认为健康且MAC地址或IP地址最小的一个节点是唯一的主要节点。本发明对此不作限定,只要能达到选择出唯一健康节点作为主要节点的实现方式,均在本发明保护范围内。2) Configure the same primary node selection rule for each node in the cluster. For example, in the normal case where no brain split occurs, all nodes that all healthy nodes consider healthy and whose MAC address or IP address is the smallest are the only primary nodes. The present invention is not limited thereto, and it is within the scope of the present invention as long as the implementation method of selecting the only healthy node as the primary node can be achieved.
3)节点故障或者心跳线断事件发生,如果恰好是主要节点故障,则在T2时间内,新的主要节点被选出来。3) A node failure or a heartbeat line break event occurs. If it is a major node failure, the new primary node is selected during T2.
总之,额外信息渠道不限于上述两种实现方式,但不论哪种实现方式,能够获得判断出节点健康与否、能够选出主要节点的实现方式均在本发明保护范围内,本发明实施例对此不作限定。In summary, the additional information channel is not limited to the above two implementation manners, but in any implementation manner, it is within the scope of the present invention to obtain an implementation manner for determining whether the node is healthy or not, and the embodiment of the present invention is This is not limited.
具体实施例中,心跳通信模块可以使用但不限于Totem组播通信协议。In a specific embodiment, the heartbeat communication module can use, but is not limited to, the Totem multicast communication protocol.
具体实施例中,服务控制逻辑模块可以使用但不限于Pacemaker或者OpenAIS的AMF。In a particular embodiment, the service control logic module may use, but is not limited to, Pacemaker or AMF of OpenAIS.
下面a.1)到a.4)是在脑裂事件发生后,第一步判决方法借助当前主要节点这个既定的信息所做出的判决方法:The following a.1) to a.4) are the decision methods made by the first decision method after the occurrence of a splitting event by means of the established information of the current main node:
a.1)脑裂发生后,主要节点所在的子集被立即判决为期望的主要子集,其他子集则被判决为次要子集。A.1) After the occurrence of a brain split, the subset of the primary node is immediately determined to be the primary subset of the expectation, and the other subsets are judged to be the secondary subset.
a.2)期望的主要子集中的每个节点的脑裂判决模块从底层心跳通信模块上报的脑裂事件消息中得到新的成员列表,也就是期望的主要子集的成员列表。A.2) The split-brain decision module of each node in the desired primary subset derives a new list of members, ie, a list of members of the desired primary subset, from the split-brain event message reported by the underlying heartbeat communication module.
a.3)期望的主要子集指派其成员列表中IP最小的节点作为期望的主要子代表,对其它次要子集进行带延迟的Fence操作。延迟时间为TdA.3) The expected primary subset assigns the node with the lowest IP in its member list as the primary primary child of the expectation, and performs the delayed Fence operation on the other secondary subset. The delay time is T d .
a.4)Td时间后,期望的主要子代表执行Fence操作,期望的主要子集最终被判决为主要子集。期望的主要子集的每个节点上报此判决结果到各自的服务控制逻辑模块。 A.4) After the T d time, the expected primary sub-representation performs the Fence operation, and the expected major subset is finally judged as the main subset. Each node of the desired primary subset reports this decision result to its respective service control logic module.
下面b.1)到b.6)是第二步判决方法。它紧接着第一步判决之后开始工作。The following b.1) to b.6) are the second-step decision methods. It started working after the first step decision.
b.1)每个子集都把成员数量和原集群成员数量作对比:若子集的成员数量大于原集群成员数量的50%,则该子集认为自己是唯一大子集。B.1) Each subset compares the number of members with the number of original cluster members: if the number of members of the subset is greater than 50% of the number of members of the original cluster, the subset considers itself to be the only large subset.
b.2)如果每个子集都不是唯一大子集,则第二步判决立即结束。转到a.4)步。B.2) If each subset is not the only large subset, the second step decision ends immediately. Go to step a.4).
b.3)一子集发现自己是唯一大子集,但该子集恰好是就是期望的主要子集,则第二步判决立即结束。转到a.4)步。B.3) A subset finds itself to be the only large subset, but the subset happens to be the main subset of expectations, and the second step of the decision ends immediately. Go to step a.4).
b.4)一子集发现自己是唯一大子集,且不是期望的主要子集,则唯一大子集指派其成员列表中IP最小的节点立即对除唯一大子集之外的其他所有节点执行一个零延迟的节点级Fence操作,让他们停止工作。因为期望的主要子集的代表被Fence,所以上述a.4)步将不会被执行到。B.4) A subset finds itself to be the only large subset and is not the main subset expected, then the only large subset assigns the node with the lowest IP in its member list to immediately all nodes except the only large subset Perform a zero-delay node-level Fence operation to stop them from working. Since the representative of the main subset expected is Fence, the above step a.4) will not be executed.
b.5)唯一大子集最终判决为集群脑裂后的主要子集。该结果被上报到唯一大子集每个节点的服务控制逻辑模块。B.5) The only large subset of the final decision is the main subset of the post-brain split. The result is reported to the service control logic module of each node of the only large subset.
b.6)由于其它子集都停止工作而不能访问判决盘,所以它们在额外信息渠道来看都不健康,从而失去了竞争主要节点的能力。T2时间后,新的主要节点必然将在唯一大子集中重新选举出来,为下一次新的脑裂判决做准备。B.6) Since other subsets stop working and cannot access the decision disk, they are not healthy in terms of additional information channels, thus losing the ability to compete for the main node. After the T2 time, the new primary node will inevitably be re-elected in the only large subset to prepare for the next new split-brain decision.
下面结合附图,对本发明的实施例二进行说明。Embodiment 2 of the present invention will be described below with reference to the accompanying drawings.
本发明实施例提供了一种集群脑裂处理方法,该方法可应用于如图1所示的节点中,由脑裂判决模块完成该方法。使用该方法,在集群脑裂时对集群进行管理控制的流程如图6所示,包括:The embodiment of the invention provides a cluster splitting processing method, which can be applied to a node as shown in FIG. 1 , and the method is completed by a brain splitting decision module. Using this method, the flow of management control of the cluster during cluster splitting is shown in Figure 6, including:
步骤601、在集群发生脑裂时,选取该集群中唯一允许继续服务的子集;Step 601: When a cluster split occurs, select a subset of the cluster that is allowed to continue to serve;
本发明实施例中,在所述集群的每个节点维护当前集群成员列表、成员数量和集群成员变化通知信息。可选地,可通过图1中的心跳通信模块维护上述信息。In the embodiment of the present invention, the current cluster member list, the number of members, and the cluster member change notification information are maintained at each node of the cluster. Alternatively, the above information can be maintained by the heartbeat communication module of FIG.
集群内发生脑裂后,形成多个子集,此时需要选择一个作为唯一允许继续服务的子集,使其他子集停止工作。本步骤如图7所示,包括: After a brain split occurs in the cluster, multiple subsets are formed. In this case, you need to select one of the subsets that are allowed to continue to serve, so that other subsets stop working. This step is shown in Figure 7, including:
步骤6011、选取在脑裂发生前的主要节点所在子集为期望的主要子集;Step 6011: selecting a subset of the main nodes before the occurrence of the brain splitting is a desired main subset;
本步骤中,根据发生脑裂前节点之间心跳线的通信,每个节点均可得知之前的主要节点,在发生脑裂时,即选择该主要节点所在子集为期望的主要子集。In this step, according to the communication of the heartbeat line between the pre-brain nodes, each node can know the previous main node. When the brain split occurs, the subset of the main node is selected as the expected main subset.
选择主要节点的方式如下:The way to select the main node is as follows:
1、在集群初始化时,在共享介质上开辟一块磁盘空间作为判决盘,将所述判决盘分区,将所述集群中的每个节点唯一对应到所述判决盘的一个分区上,所述集群中的每个节点通过磁盘I/O操作向所述判决盘中对应的分区里写入当前时间戳。1. When the cluster is initialized, a disk space is opened on the shared medium as a decision disk, and the decision disk is partitioned, and each node in the cluster is uniquely corresponding to a partition of the decision disk, the cluster Each node in the middle writes the current timestamp to the corresponding partition in the decision disk through a disk I/O operation.
然后,选择持续更新时间戳的节点之一作为主要节点。持续更新时间戳说明该节点连接正常,属于健康节点,可从中选择一个作为主要节点。选择规则可根据需要配置,在集群中的全部节点上配置同样的规则。其中,持续更新时间戳可以是在一时间范围内更新时间戳的次数大于阈值。Then, select one of the nodes that continuously update the timestamp as the primary node. The continuous update timestamp indicates that the node is connected properly and belongs to the healthy node, and one of them can be selected as the primary node. The selection rules can be configured as needed to configure the same rules on all nodes in the cluster. The continuous update timestamp may be that the number of times the timestamp is updated within a time range is greater than a threshold.
如果主要节点发生故障,将该故障的节点排除后,从剩余的健康节点中再次选取新的主要节点。If the primary node fails, after the failed node is excluded, the new primary node is again selected from the remaining healthy nodes.
2、所述集群中的每个节点在没有发生脑裂的正常情况下,通过额外的Ethernet网络周期广播或组播KeepAlive消息。2. Each node in the cluster periodically broadcasts or multicasts a KeepAlive message through an additional Ethernet network under normal conditions without brain splitting.
然后,选择持续发出所述KeepAlive消息的节点之一作为主要节点。持续发出KeepAlive消息说明该节点连接正常,属于健康节点,可从中选择一个作为主要节点。选择规则可根据需要配置,在集群中的全部节点上配置同样的规则。其中,持续发出所述KeepAlive消息可以是在一时间范围内发出所述KeepAlive消息的次数大于阈值。Then, one of the nodes that continuously issue the KeepAlive message is selected as the primary node. The keep-alive KeepAlive message indicates that the node is connected properly and belongs to the healthy node, and one of them can be selected as the primary node. The selection rules can be configured as needed to configure the same rules on all nodes in the cluster. The continuously issuing the KeepAlive message may be that the number of times the KeepAlive message is sent within a time range is greater than a threshold.
如果主要节点发生故障,将该故障的节点排除后,从剩余的健康节点中再次选取新的主要节点。If the primary node fails, after the failed node is excluded, the new primary node is again selected from the remaining healthy nodes.
步骤6012、从所述期望的主要子集中指派一个代表节点,指示所述代表节点在第一延迟时间之后通知除所述期望的主要子集外的其他每个子集的全部节点停止工作。Step 6012: Assign a representative node from the desired primary subset, instructing the representative node to notify all nodes of each of the subsets except the expected primary subset to stop working after the first delay time.
步骤6013、选取节点数量大于脑裂发生前集群节点数量一半的子集作为 唯一大子集;Step 6013: Select a subset of the number of nodes that is greater than half of the number of cluster nodes before the occurrence of brain splitting. The only large subset;
本步骤为可选步骤,在存在这样节点数量大于脑裂发生前集群内全部节点数量一半的子集时,以该子集作为唯一大子集。This step is an optional step. When there is such a subset that the number of nodes is greater than half of the total number of nodes in the cluster before the occurrence of brain splitting, the subset is used as the only large subset.
步骤6014、从所述唯一大子集中选择一个节点作为唯一大子集代表;Step 6014: Select a node from the unique large subset as the only large subset representative;
本步骤为可选步骤,即在步骤6013中确定存在唯一大子集时,本步骤选取该子集中的一个节点作为唯一大子集代表。This step is an optional step. When it is determined in step 6013 that there is a unique large subset, this step selects one node in the subset as the only large subset representative.
步骤6015、指示所述唯一大子集代表判定所述唯一大子集与所述期望的主要子集为不同子集时,在零时延或第二延迟时间后,通知除所述唯一大子集外的其他子集的全部节点停止工作,所述第二延迟时间小于所述第一延迟时间;Step 6015: Instructing the unique large subset to determine that the unique large subset is different from the expected primary subset, and after the zero delay or the second delay time, notify the only one All nodes of other subsets outside the set stop working, and the second delay time is less than the first delay time;
本步骤为可选步骤,在存在唯一大子集时执行本步骤。This step is an optional step that is performed when there is a unique large subset.
本步骤中,通过小于第一延迟时间的第二延迟时间,这样,能够确保在完成确认是否存在唯一大子集操作之后,期望的主要子集才有可能发出要求其他子集停止工作的通知,不会发生存在唯一大子集,但在唯一大子集被选择出来之前,就由期望的主要子集通知该唯一大子集中的节点停止工作,导致处理能力损失的问题。In this step, the second delay time is less than the first delay time, so that it can be ensured that after the completion of the confirmation whether there is a unique large subset operation, the desired main subset is likely to issue a notification requesting that the other subsets stop working. It does not happen that there is a unique large subset, but before the only large subset is selected, the expected major subset notifies the node in the only large subset to stop working, resulting in a loss of processing power.
步骤6016、从所述期望的主要子集和所述唯一大子集中,选择唯一允许继续服务的子集;Step 6016: Select, from the desired primary subset and the unique large subset, a subset that is uniquely allowed to continue serving;
本步骤涉及以下几种情况:This step involves the following situations:
在不存在唯一大子集时,选择所述期望的主要子集作为唯一允许继续服务的子集;When there is no unique large subset, the desired primary subset is selected as the only subset that is allowed to continue the service;
在所述期望的主要子集与所述唯一大子集为同一子集时,以该子集作为唯一允许继续服务的子集;When the desired primary subset is the same subset as the unique large subset, the subset is used as the only subset that allows for continued service;
在所述期望的主要子集与所述唯一大子集为不同子集时,以所述唯一大子集作为唯一允许继续服务的子集。When the desired primary subset and the unique large subset are different subsets, the unique large subset is used as the only subset that allows for continued service.
此外,在判定发生脑裂时,还需要中断节点底层心跳通信与上层服务控 制逻辑间的通信,至达到第一时间长度后,恢复所述底层心跳通信与上层服务控制逻辑间的通信。这样做的目的,是保证上层服务控制逻辑在第一时间长度内不对该集群脑裂事件作出响应,为在底层与上层间完成唯一允许继续服务的子集争取时间。In addition, when it is determined that a brain split occurs, it is also necessary to interrupt the node bottom heartbeat communication and the upper layer service control. The communication between the logics is resumed until the first time length is reached, and the communication between the underlying heartbeat communication and the upper layer service control logic is resumed. The purpose of this is to ensure that the upper service control logic does not respond to the cluster split event for the first time period, and strives for the time to complete the only subset that allows the service to continue between the upper layer and the upper layer.
在中断底层心跳通信与上层服务控制逻辑间的通信的同时,完成唯一允许继续服务子集的选择。在以唯一大子集作为唯一允许继续服务子集时,还需要选择新的主要节点,可选的,从所述唯一大子集中选举一个节点作为新的主要节点,选举所述新的主要节点耗时从判定发生脑裂时起至第二时间长度止,所述第二时间长度小于所述第一时间长度。这样,在新的主要节点选择完成后,底层心跳通信与上层服务控制逻辑间的通信才恢复,上层服务控制逻辑直接获取了新的主要服务节点信息,避免了每个节点各自判断自身运行状态带来的管理混乱的问题。While interrupting communication between the underlying heartbeat communication and the upper layer service control logic, the only choice to allow continued service subsets is completed. When the only large subset is used as the only allowed to continue the service subset, it is also necessary to select a new primary node. Optionally, a node is elected from the unique large subset as a new primary node, and the new primary node is elected. The time consuming is from the time when the brain splitting is determined to the second time length, and the second time length is less than the first time length. In this way, after the selection of the new primary node is completed, the communication between the underlying heartbeat communication and the upper layer service control logic is restored, and the upper layer service control logic directly acquires the new primary service node information, thereby avoiding each node determining its own running status band. The problem of management confusion.
步骤602、控制除所述唯一允许继续服务的子集外的其他子集中的节点停止工作;Step 602: Control the nodes in the other subsets except the subset that is allowed to continue to serve to stop working;
本步骤中,可由唯一允许继续服务的子集内的节点通知其他子集中的节点停止工作。In this step, the nodes in the subset that are allowed to continue the service can be notified that the nodes in the other subsets stop working.
下面结合附图,对本发明的实施例三进行说明。Embodiment 3 of the present invention will be described below with reference to the accompanying drawings.
本发明实施例提供了一种集群脑裂处理装置,该装置的结构如图8所示,包括:The embodiment of the present invention provides a cluster splitting processing device. The structure of the device is as shown in FIG. 8 and includes:
继续服务子集选择模块801,设置为在集群发生脑裂时,选取该集群中唯一允许继续服务的子集;The continuation service subset selection module 801 is configured to select a subset of the cluster that is allowed to continue to serve when the cluster has a brain split;
节点停工控制模块802,设置为控制除所述唯一允许继续服务的子集外的其他子集中的节点停止工作。The node downtime control module 802 is configured to control the nodes in the other subsets except the subset that is only allowed to continue to service to stop working.
可选地,所述继续服务子集选择模块801的结构如图9所示,包括:Optionally, the structure of the continuation service subset selection module 801 is as shown in FIG. 9, and includes:
期望主要子集选取单元8011,设置为选取在脑裂发生前的主要节点所在子集为期望的主要子集;The main subset selection unit 8011 is configured to select a subset of the main nodes before the occurrence of the brain splitting as a desired main subset;
唯一大子集选取单元8012,设置为选取节点数量大于脑裂发生前集群节 点数量一半的子集作为唯一大子集;The only large subset selection unit 8012 is set to select the number of nodes larger than the cluster section before the occurrence of the brain splitting A subset of half the number of points as the only large subset;
继续服务子集选取单元8013,设置为从所述期望的主要子集和所述唯一大子集中,选择唯一允许继续服务的子集。The continuation service subset selection unit 8013 is arranged to select a subset that is uniquely allowed to continue from the desired primary subset and the unique large subset.
可选地,所述继续服务子集选择模块801还包括:Optionally, the continuation service subset selection module 801 further includes:
代表节点选择单元8014,设置为从所述期望的主要子集中指派一个代表节点,指示所述代表节点在第一延迟时间之后通知除所述期望的主要子集外的其他子集的全部节点停止工作。Representative node selection unit 8014, configured to assign a representative node from the desired primary subset, instructing the representative node to notify all nodes of the subset other than the desired primary subset to stop after the first delay time jobs.
可选地,所述继续服务子集选取单元8013的结构如图10所示,包括:Optionally, the structure of the continuation service subset selection unit 8013 is as shown in FIG. 10, and includes:
第一选取子单元1001,设置为在不存在唯一大子集时,选择所述期望的主要子集作为唯一允许继续服务的子集;a first selection sub-unit 1001, configured to select the desired primary subset as a subset that is uniquely allowed to continue to serve when there is no uniquely large subset;
第二选取子单元1002,设置为在所述期望的主要子集与所述唯一大子集为同一子集时,以该子集作为唯一允许继续服务的子集;a second selection sub-unit 1002, configured to use the subset as the only subset that allows for continued service when the desired primary subset is the same subset as the unique large subset;
第三选取子单元1003,设置为在所述期望的主要子集与所述唯一大子集为不同子集时,以所述唯一大子集作为唯一允许继续服务的子集。The third selection sub-unit 1003 is configured to use the unique large subset as the only subset that allows for continued service when the desired primary subset and the unique large subset are different subsets.
可选地,所述继续服务子集选择模块801还包括:Optionally, the continuation service subset selection module 801 further includes:
唯一大子集代表选择单元8015,设置为从所述唯一大子集中选择一个节点作为唯一大子集代表,指示所述唯一大子集代表判定所述唯一大子集与所述期望的主要子集为不同子集时,在零时延或第二延迟时间后,通知除所述唯一大子集外的其他子集的全部节点停止工作,所述第二延迟时间小于所述第一延迟时间。A uniquely large subset representative selection unit 8015 is arranged to select a node from the unique large subset as the only large subset representative, indicating that the unique large subset representative determines the unique large subset and the desired primary child When the sets are different subsets, after zero delay or the second delay time, all nodes of the subset other than the unique large subset are notified to stop working, and the second delay time is less than the first delay time. .
可选地,该装置还包括:Optionally, the device further includes:
内部通信管理模块803,设置为节点检测到心跳线通信发生中断时,中断该节点底层心跳通信与上层服务控制逻辑间的通信,至达到第一时间长度后,判定脑裂发生,恢复所述底层心跳通信与上层服务控制逻辑间的通信。The internal communication management module 803 is configured to interrupt the communication between the underlying heartbeat communication of the node and the upper layer service control logic when the node detects that the heartbeat communication is interrupted, and after the first time length is reached, determine that the brain split occurs and restore the bottom layer. Communication between heartbeat communication and upper layer service control logic.
可选地,在以所述唯一大子集作为唯一允许继续服务的子集时,该装置还包括:Optionally, when the only large subset is used as the only subset that allows to continue the service, the apparatus further includes:
主要节点选举模块804,设置为从所述唯一大子集中选举一个节点作为 新的主要节点,选举所述新的主要节点耗时从判定发生脑裂时起至第二时间长度止,所述第二时间长度小于所述第一时间长度。The primary node election module 804 is configured to elect a node from the only large subset The new primary node, the election of the new primary node takes time from the time when the brain splitting is determined to the second time length, and the second time length is less than the first time length.
可选地,该装置还包括:Optionally, the device further includes:
存储模块805,设置为维护当前集群成员列表、成员数量和集群成员变化通知信息。The storage module 805 is configured to maintain the current cluster member list, the number of members, and the cluster member change notification information.
上述集群脑裂处理装置可集成于集群内的节点中,在底层心跳通信与上层服务控制逻辑间,结合本发明的实施例提供的集群脑裂处理方法,由节点完成相应功能。The cluster splitting processing device can be integrated into the nodes in the cluster, and the node splitting processing method provided by the embodiment of the present invention is implemented by the node between the underlying heartbeat communication and the upper layer service control logic.
本发明的实施例提供了一种集群脑裂处理方法和装置,在集群发生脑裂时,选取该集群中唯一允许继续服务的子集,控制除所述唯一允许继续服务的子集外的其他子集中的节点停止工作。实现了集群脑裂情况下对集群的有序管理,解决了集群脑裂后的控制问题。Embodiments of the present invention provide a cluster splitting processing method and apparatus. When a cluster splits, a subset of the cluster that is allowed to continue to serve is selected, and other than the subset that is allowed to continue to serve is controlled. The nodes in the subset stop working. The orderly management of the cluster under the condition of cluster splitting is realized, and the control problem after cluster splitting is solved.
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。One of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments can be implemented using a computer program flow, which can be stored in a computer readable storage medium, such as on a corresponding hardware platform (eg, The system, device, device, device, etc. are executed, and when executed, include one or a combination of the steps of the method embodiments.
可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。上述实施例中的装置/功能模块/功能单元可以采用通用的计算装置来实现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。Alternatively, all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve. The devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
上述实施例中的装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。When the device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. The above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
本发明的保护范围应以权利要求所述的保护范围为准。 The scope of the invention should be determined by the scope of the claims.
工业实用性Industrial applicability
本发明实施例实现了集群脑裂情况下对集群的有序管理,解决了集群脑裂后的控制问题。 The embodiment of the invention realizes the orderly management of the cluster under the condition of cluster splitting, and solves the control problem of the post-brain splitting.

Claims (19)

  1. 一种集群脑裂处理方法,包括:A cluster brain splitting method includes:
    在集群发生脑裂而分裂为多个子集时,从所述多个子集中选取唯一允许继续服务的子集;When a cluster splits into multiple subsets, a subset that is uniquely allowed to continue to be serviced is selected from the plurality of subsets;
    控制除所述唯一允许继续服务的子集外的其他子集中的节点停止工作。Controls the nodes in other subsets except the subset that is allowed to continue to serve to stop working.
  2. 根据权利要求1所述的集群脑裂处理方法,其中,在集群发生脑裂而分裂为多个子集时,从所述多个子集选取唯一允许继续服务的子集包括:The cluster splitting processing method according to claim 1, wherein when the cluster is split and split into a plurality of subsets, selecting a subset from the plurality of subsets that is uniquely allowed to continue to serve includes:
    选取在脑裂发生前的主要节点所在子集为期望的主要子集;Select the subset of the main nodes before the occurrence of the brain splitting as the main subset expected;
    选取节点数量大于脑裂发生前集群节点数量一半的子集作为唯一大子集;Selecting a subset of nodes that is greater than half the number of cluster nodes before the occurrence of cerebral rupture as the only large subset;
    从所述期望的主要子集和所述唯一大子集中,选择唯一允许继续服务的子集。From the desired primary subset and the unique large subset, select a subset that is uniquely allowed to continue service.
  3. 根据权利要求2所述的集群脑裂处理方法,该方法还包括:The cluster splitting processing method according to claim 2, further comprising:
    在集群初始化时,在共享介质上开辟一块磁盘空间作为判决盘,将所述判决盘分区,将所述集群中的每个节点唯一对应到所述判决盘的一个分区上;When the cluster is initialized, a disk space is opened on the shared medium as a decision disk, and the decision disk is partitioned, and each node in the cluster is uniquely corresponding to a partition of the decision disk;
    所述集群中的每个节点通过磁盘输入/输出I/O操作向所述判决盘中对应的分区里写入当前时间戳;Each node in the cluster writes a current timestamp to a corresponding partition in the decision disk through a disk input/output I/O operation;
    选择在一时间范围内更新时间戳的次数大于阈值的节点之一作为主要节点。One of the nodes whose number of times of updating the time stamp is greater than the threshold in a time range is selected as the primary node.
  4. 根据权利要求2所述的集群脑裂方法,该方法还包括:The cluster splitting method according to claim 2, further comprising:
    所述集群中的每个节点在没有发生脑裂的正常情况下,通过额外的以太网周期广播或组播KeepAlive消息;Each node in the cluster broadcasts or multicasts a KeepAlive message through an additional Ethernet period under normal conditions without brain splitting;
    选择在一时间范围内发出所述KeepAlive消息的次数大于阈值的节点之一作为主要节点。One of the nodes that issued the KeepAlive message for a number of times greater than the threshold within a time range is selected as the primary node.
  5. 根据权利要求2所述的集群脑裂处理方法,选取在脑裂发生前的主要 节点所在子集为期望的主要子集的步骤之后,该方法还包括:The cluster splitting treatment method according to claim 2, which selects the main before the occurrence of cerebral schizophrenia After the step of the node having the subset of the desired primary subset, the method further includes:
    从所述期望的主要子集中指派一个代表节点,指示所述代表节点在第一延迟时间之后通知除所述期望的主要子集外的其他每个子集的全部节点停止工作。A representative node is assigned from the desired primary subset, instructing the representative node to notify all nodes of each of the subsets other than the desired primary subset to stop working after the first delay time.
  6. 根据权利要求5所述的集群脑裂处理方法,其中,从所述期望的主要子集和所述唯一大子集中,选择唯一允许继续服务的子集包括:The cluster splitting processing method according to claim 5, wherein selecting a subset that is uniquely allowed to continue from the desired primary subset and the unique large subset includes:
    在不存在唯一大子集时,选择所述期望的主要子集作为唯一允许继续服务的子集;When there is no unique large subset, the desired primary subset is selected as the only subset that is allowed to continue the service;
    在所述期望的主要子集与所述唯一大子集为同一子集时,以该子集作为唯一允许继续服务的子集;When the desired primary subset is the same subset as the unique large subset, the subset is used as the only subset that allows for continued service;
    在所述期望的主要子集与所述唯一大子集为不同子集时,以所述唯一大子集作为唯一允许继续服务的子集。When the desired primary subset and the unique large subset are different subsets, the unique large subset is used as the only subset that allows for continued service.
  7. 根据权利要求6所述的集群脑裂处理方法,选取节点数量大于脑裂发生前集群节点数量一半的子集作为唯一大子集的步骤之后,该方法还包括:The cluster splitting processing method according to claim 6, wherein after the step of selecting a subset of the number of nodes greater than half of the number of cluster nodes before the occurrence of the mitral split as the only large subset, the method further comprises:
    从所述唯一大子集中选择一个节点作为唯一大子集代表;Selecting a node from the only large subset as the only large subset representative;
    指示所述唯一大子集代表判定所述唯一大子集与所述期望的主要子集为不同子集时,在零时延或第二延迟时间后,通知除所述唯一大子集外的其他子集的全部节点停止工作,所述第二延迟时间小于所述第一延迟时间。Instructing the unique large subset representative to determine that the unique large subset is different from the expected primary subset, and after zero delay or the second delay time, notify the other than the only large subset All nodes of the other subsets stop working, and the second delay time is less than the first delay time.
  8. 根据权利要求6所述的集群脑裂处理方法,该方法还包括:The cluster splitting processing method according to claim 6, further comprising:
    在节点检测到心跳线通信发生中断时,中断该节点底层心跳通信与上层服务控制逻辑间的通信,至达到第一时间长度后,判定脑裂发生,恢复所述底层心跳通信与上层服务控制逻辑间的通信。When the node detects that the heartbeat communication is interrupted, the communication between the underlying heartbeat communication of the node and the upper layer service control logic is interrupted, and after reaching the first time length, the occurrence of brain splitting is determined, and the underlying heartbeat communication and the upper layer service control logic are restored. Communication between.
  9. 根据权利要求8所述的集群脑裂处理方法,在以所述唯一大子集作为唯一允许继续服务的子集时,该方法还包括:The cluster splitting processing method according to claim 8, wherein when the only large subset is used as the only subset that allows the service to continue, the method further includes:
    从所述唯一大子集中选举一个节点作为新的主要节点,选举所述新的主要节点耗时从判定发生脑裂时起至第二时间长度止,所述第二时间长度小于 所述第一时间长度。Electing a node from the unique large subset as a new primary node, and electing the new primary node takes time from the time when the brain splitting is determined to the second time length, and the second time length is less than The first length of time.
  10. 根据权利要求1所述的集群脑裂处理方法,该方法还包括:The cluster splitting processing method according to claim 1, further comprising:
    在所述集群的每个节点维护当前集群成员列表、成员数量和集群成员变化通知信息。The current cluster member list, the number of members, and the cluster member change notification information are maintained at each node of the cluster.
  11. 一种集群脑裂处理装置,包括:A cluster splitting device includes:
    继续服务子集选择模块,设置为:在集群发生脑裂而分裂为多个子集时,从所述多个子集中选取唯一允许继续服务的子集;以及The service subset selection module is further configured to: when the cluster splits into a plurality of subsets, select a subset from the plurality of subsets that is allowed to continue to serve;
    节点停工控制模块,设置为:控制除所述唯一允许继续服务的子集外的其他子集中的节点停止工作。The node shutdown control module is configured to: control the nodes in the other subsets except the subset that is allowed to continue to serve to stop working.
  12. 根据权利要求11所述的集群脑裂处理装置,其中,所述继续服务子集选择模块包括:The cluster splitting processing device according to claim 11, wherein the continuation service subset selection module comprises:
    期望主要子集选取单元,设置为:选取在脑裂发生前的主要节点所在子集为期望的主要子集;The main subset selection unit is expected to be set to: select a subset of the main nodes before the occurrence of the brain splitting as a desired main subset;
    唯一大子集选取单元,设置为:选取节点数量大于脑裂发生前集群节点数量一半的子集作为唯一大子集;以及The only large subset selection unit is set to: select the subset whose number of nodes is greater than half the number of cluster nodes before the occurrence of the brain split as the only large subset;
    继续服务子集选取单元,设置为:从所述期望的主要子集和所述唯一大子集中,选择唯一允许继续服务的子集。The continuation service subset selection unit is configured to select a subset that is uniquely allowed to continue from the desired primary subset and the unique large subset.
  13. 根据权利要求12所述的集群脑裂处理装置,其中,所述继续服务子集选择模块还包括:The cluster splitting processing device of claim 12, wherein the continuation service subset selection module further comprises:
    代表节点选择单元,设置为:从所述期望的主要子集中指派一个代表节点,指示所述代表节点在第一延迟时间之后通知除所述期望的主要子集外的其他每个子集的全部节点停止工作。a representative node selecting unit, configured to: assign a representative node from the desired primary subset, instructing the representative node to notify all nodes of each of the subsets except the expected primary subset after the first delay time stop working.
  14. 根据权利要求13所述的集群脑裂处理装置,其中,所述继续服务子集选取单元包括:The cluster splitting processing device according to claim 13, wherein the continuation service subset selecting unit comprises:
    第一选取子单元,设置为:在不存在唯一大子集时,选择所述期望的主要子集作为唯一允许继续服务的子集; a first selection sub-unit, configured to: when the unique large subset does not exist, select the desired primary subset as a subset that is uniquely allowed to continue serving;
    第二选取子单元,设置为:在所述期望的主要子集与所述唯一大子集为同一子集时,以该子集作为唯一允许继续服务的子集;以及a second selection subunit, configured to: when the desired primary subset is the same subset as the unique large subset, the subset is the only subset that is allowed to continue serving;
    第三选取子单元,设置为:在所述期望的主要子集与所述唯一大子集为不同子集时,以所述唯一大子集作为唯一允许继续服务的子集。And a third selection subunit, configured to: when the expected primary subset and the unique large subset are different subsets, use the unique large subset as the only subset that allows to continue the service.
  15. 根据权利要求14所述的集群脑裂处理装置,其中,所述继续服务子集选择模块还包括:The cluster splitting processing device of claim 14, wherein the continuation service subset selection module further comprises:
    唯一大子集代表选择单元,设置为:从所述唯一大子集中选择一个节点作为唯一大子集代表,指示所述唯一大子集代表判定所述唯一大子集与所述期望的主要子集为不同子集时,在零时延或第二延迟时间后,通知除所述唯一大子集外的其他子集的全部节点停止工作,所述第二延迟时间小于所述第一延迟时间。The only large subset represents a selection unit, configured to: select a node from the unique large subset as a unique large subset representation, indicating that the unique large subset represents determining the unique large subset and the desired primary child When the sets are different subsets, after zero delay or the second delay time, all nodes of the subset other than the unique large subset are notified to stop working, and the second delay time is less than the first delay time. .
  16. 根据权利要求14所述的集群脑裂处理装置,该装置还包括:The cluster splitting device according to claim 14, further comprising:
    内部通信管理模块,设置为:在节点检测到心跳线通信发生中断时,中断该节点底层心跳通信与上层服务控制逻辑间的通信,至达到第一时间长度后,判定脑裂发生,恢复所述底层心跳通信与上层服务控制逻辑间的通信。The internal communication management module is configured to: when the node detects that the heartbeat communication is interrupted, interrupt communication between the bottom heartbeat communication of the node and the upper service control logic, and after reaching the first time length, determine the occurrence of brain splitting, and restore the Communication between the underlying heartbeat communication and the upper layer service control logic.
  17. 根据权利要求16所述的集群脑裂处理装置,其中,在以所述唯一大子集作为唯一允许继续服务的子集时,该装置还包括:The cluster splitting processing apparatus according to claim 16, wherein, when said unique large subset is the only subset that allows for continued service, the apparatus further comprises:
    主要节点选举模块,设置为:从所述唯一大子集中选举一个节点作为新的主要节点,选举所述新的主要节点耗时从判定发生脑裂时起至第二时间长度止,所述第二时间长度小于所述第一时间长度。The primary node election module is configured to: elect a node from the unique large subset as a new primary node, and the election of the new primary node takes time from the time when the brain split is determined to the second time length, the first The second time length is less than the first time length.
  18. 根据权利要求11所述的集群脑裂处理装置,该装置还包括:The cluster splitting device according to claim 11, further comprising:
    存储模块,设置为:维护当前集群成员列表、成员数量和集群成员变化通知信息。The storage module is set to maintain the current cluster member list, the number of members, and the cluster member change notification information.
  19. 一种计算机可读存储介质,存储有程序指令,当该程序指令被执行时可实现权利要求1-10任一项所述的方法。 A computer readable storage medium storing program instructions that, when executed, can implement the method of any of claims 1-10.
PCT/CN2015/079096 2014-09-29 2015-05-15 Cluster split brain processing method and apparatus WO2016050074A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410515113.5A CN105450717A (en) 2014-09-29 2014-09-29 Method and device for processing brain split in cluster
CN201410515113.5 2014-09-29

Publications (1)

Publication Number Publication Date
WO2016050074A1 true WO2016050074A1 (en) 2016-04-07

Family

ID=55560485

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/079096 WO2016050074A1 (en) 2014-09-29 2015-05-15 Cluster split brain processing method and apparatus

Country Status (2)

Country Link
CN (1) CN105450717A (en)
WO (1) WO2016050074A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684032A (en) * 2018-12-04 2019-04-26 武汉烽火信息集成技术有限公司 The OpenStack virtual machine High Availabitity calculate node device and management method of anti-fissure
US11544228B2 (en) 2020-05-07 2023-01-03 Hewlett Packard Enterprise Development Lp Assignment of quora values to nodes based on importance of the nodes

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107508694B (en) * 2016-06-14 2021-11-16 中兴通讯股份有限公司 Node management method and node equipment in cluster
CN109257195B (en) 2017-07-12 2021-01-15 华为技术有限公司 Fault processing method and equipment for nodes in cluster
CN111835534B (en) * 2019-04-15 2022-05-06 华为技术有限公司 Method for cluster control, network device, master control node device and computer readable storage medium
CN112181305B (en) * 2020-09-30 2024-06-07 北京人大金仓信息技术股份有限公司 Database cluster network partition selection method and device
CN114374707B (en) * 2022-03-22 2022-06-21 联想凌拓科技有限公司 Management method, device, equipment and medium for storage cluster
CN114756410B (en) * 2022-06-15 2022-11-11 苏州浪潮智能科技有限公司 Data recovery method, device and medium for dual-computer hot standby system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004031979A2 (en) * 2002-10-07 2004-04-15 Fujitsu Siemens Computers, Inc. Method of solving a split-brain condition
US8024432B1 (en) * 2008-06-27 2011-09-20 Symantec Corporation Method and apparatus for partitioning a computer cluster through coordination point devices
CN102308559A (en) * 2011-07-26 2012-01-04 华为技术有限公司 Voting arbitration method and apparatus for cluster computer system
CN102394914A (en) * 2011-09-22 2012-03-28 浪潮(北京)电子信息产业有限公司 Cluster brain-split processing method and device
CN103209095A (en) * 2013-03-13 2013-07-17 广东新支点技术服务有限公司 Method and device for preventing split brain on basis of disk service lock
CN103684941A (en) * 2013-11-23 2014-03-26 广东新支点技术服务有限公司 Arbitration server based cluster split-brain prevent method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291243B (en) * 2007-04-16 2012-10-10 广东新支点技术服务有限公司 Split brain preventing method for highly available cluster system
KR101042908B1 (en) * 2009-02-12 2011-06-21 엔에이치엔(주) Method, system, and computer-readable recording medium for determining major group under split-brain network problem
CN102402395B (en) * 2010-09-16 2014-07-16 中标软件有限公司 Quorum disk-based non-interrupted operation method for high availability system
US9146705B2 (en) * 2012-04-09 2015-09-29 Microsoft Technology, LLC Split brain protection in computer clusters

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004031979A2 (en) * 2002-10-07 2004-04-15 Fujitsu Siemens Computers, Inc. Method of solving a split-brain condition
US8024432B1 (en) * 2008-06-27 2011-09-20 Symantec Corporation Method and apparatus for partitioning a computer cluster through coordination point devices
CN102308559A (en) * 2011-07-26 2012-01-04 华为技术有限公司 Voting arbitration method and apparatus for cluster computer system
CN102394914A (en) * 2011-09-22 2012-03-28 浪潮(北京)电子信息产业有限公司 Cluster brain-split processing method and device
CN103209095A (en) * 2013-03-13 2013-07-17 广东新支点技术服务有限公司 Method and device for preventing split brain on basis of disk service lock
CN103684941A (en) * 2013-11-23 2014-03-26 广东新支点技术服务有限公司 Arbitration server based cluster split-brain prevent method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684032A (en) * 2018-12-04 2019-04-26 武汉烽火信息集成技术有限公司 The OpenStack virtual machine High Availabitity calculate node device and management method of anti-fissure
US11544228B2 (en) 2020-05-07 2023-01-03 Hewlett Packard Enterprise Development Lp Assignment of quora values to nodes based on importance of the nodes

Also Published As

Publication number Publication date
CN105450717A (en) 2016-03-30

Similar Documents

Publication Publication Date Title
WO2016050074A1 (en) Cluster split brain processing method and apparatus
US11360854B2 (en) Storage cluster configuration change method, storage cluster, and computer system
WO2016150066A1 (en) Master node election method and apparatus, and storage system
US9983957B2 (en) Failover mechanism in a distributed computing system
RU2507703C2 (en) Resource pooling in electronic board cluster switching centre server
JP4505763B2 (en) Managing node clusters
US11416359B2 (en) Hot standby method, apparatus, and system
US10177994B2 (en) Fault tolerant federation of computing clusters
WO2016070375A1 (en) Distributed storage replication system and method
EP3016316A1 (en) Network control method and apparatus
JP2007528557A (en) Quorum architecture based on scalable software
US20210216417A1 (en) Hot-standby redundancy control system, method, control apparatus, and computer readable storage medium
EP2954424A1 (en) Method, device, and system for peer-to-peer data replication and method, device, and system for master node switching
US9807051B1 (en) Systems and methods for detecting and resolving split-controller or split-stack conditions in port-extended networks
CN106230622B (en) Cluster implementation method and device
CN114124650A (en) Master-slave deployment method of SPTN (shortest Path bridging) network controller
CN108173971A (en) A kind of MooseFS high availability methods and system based on active-standby switch
CN105790825A (en) Method and apparatus for carrying out hot backup on controllers in distributed protection
WO2024008156A1 (en) Database system, and master database election method and apparatus
CN110971662A (en) Two-node high-availability implementation method and device based on Ceph
CN114531373A (en) Node state detection method, node state detection device, equipment and medium
CN107071189B (en) Connection method of communication equipment physical interface
CN104518995B (en) Interchanger virtualization system based on distributed structure/architecture
CN108763312B (en) Slave data node screening method based on load
Aly A novel fault tolerance mechanism for software defined networking

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15847589

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15847589

Country of ref document: EP

Kind code of ref document: A1