WO2004031979A2 - Method of solving a split-brain condition - Google Patents

Method of solving a split-brain condition Download PDF

Info

Publication number
WO2004031979A2
WO2004031979A2 PCT/EP2003/010985 EP0310985W WO2004031979A2 WO 2004031979 A2 WO2004031979 A2 WO 2004031979A2 EP 0310985 W EP0310985 W EP 0310985W WO 2004031979 A2 WO2004031979 A2 WO 2004031979A2
Authority
WO
WIPO (PCT)
Prior art keywords
node
weight
shutdown
cluster
nodes
Prior art date
Application number
PCT/EP2003/010985
Other languages
French (fr)
Other versions
WO2004031979A3 (en
Inventor
Joseph W. Armstrong
Shu-Ching Hsu
Miriyala Sreekanth
Original Assignee
Fujitsu Siemens Computers, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Siemens Computers, Inc. filed Critical Fujitsu Siemens Computers, Inc.
Priority to EP03798928A priority Critical patent/EP1550036B1/en
Priority to DE60318468T priority patent/DE60318468T2/en
Priority to AU2003276045A priority patent/AU2003276045A1/en
Publication of WO2004031979A2 publication Critical patent/WO2004031979A2/en
Publication of WO2004031979A3 publication Critical patent/WO2004031979A3/en
Priority to US11/101,720 priority patent/US7843811B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1425Reconfiguring to eliminate the error by reconfiguration of node membership
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route

Definitions

  • the invention relates to a method in a cluster system comprising a first and at least one second node, said nodes con- nected to a communication network and having a name and a host weight assigned to it, the method implemented in at least one of the first and of the at least one second node.
  • Computer cluster systems consist of individual computers called nodes which are connected via a communication network.
  • the communication network allows them to establish a communication link or channel in between two nodes.
  • Computer clusters also consist of a shared storage device which is connected to each of the nodes of the computer cluster. On those shared device some data is stored, which is used by more than one node in the cluster.
  • means for data transmission between the nodes and the shared devices are required. For example if one node in the cluster writes data in a file on the shared storage device, a second node is not allowed to read that file until the first node has finished the writing process. In normal conditions the first node writing the data in the file of the shared device will tell the second node of the writing process, thereby preventing the second node reading the now outdated file. This task is done via the working communication channel between the two nodes .
  • one node in the computer cluster breaks down, it will normally stop using the shared device. Other nodes in the com- puter cluster can use the data on the shared device without the risk of data corruption.
  • a breakdown of the communication channel is called split-brain condition. In this case a node in one of the resulting subclusters might write data in the file on a shared device while a second node in the other resulting subcluster reads or writes the file at the same time.
  • a breakdown communication channel might lead to uncoordinated activities on shared devices. Therefore, it is necessary to shut down one of the resulting subclusters completely.
  • a shutdown process of a subcluster system is normally done in the way that the nodes of one subcluster are sending shutdown commands to the nodes of the second subcluster.
  • a node of one subcluster is target of multiple shutdown requests that may cause panic and undesired crashes among the nodes receiving those requests.
  • the members of the surviving subcluster might not be known prior to the beginning of the shutdown attempts. This might lead to the situation that a non-optimal subcluster will survive, which is not be able to handle all necessary applications running on the cluster system.
  • This task of this invention is to provide a method in a clus- ter system for a shutdown process in a split-brain condition.
  • the method should solve a split-brain condition with a remaining defined and determined subcluster.
  • the task is solved using the method with features of the pre- sent invention.
  • the inventive method is implemented in a cluster system, comprising a first and at least one second node.
  • the said nodes are connected to a communication network and are able to com- munication with each other via the communication network.
  • Each of said nodes is having a name and a host weight assigned to it.
  • the method comprises the steps of: a) Inspecting a communication link via the computer network between the first and the at least one second node; b) Determining which of the at least one second node is to be shut after a failure message of the communication link via the communication network between the first and the at least one second node is received; c) Creating an advertisement report for the at least one second node determined to be shut down; d) Sending the advertisement report to at least one node of the cluster system comprising the first and the at least one second node; e) Calculating a delay time depending at least on the weight of the first node; f) Sending a shutdown command to the at least one second node after the expiry of the calculated delay time.
  • This method introduces a particular order in which the shutdown commands are sent.
  • Each shutdown command is sent after the expiry of a calculated delay time, which is depending of at least the weight of the first node. Calculating the delay time will ensure that in a split-brain condition the subcluster with the greatest weight will automatically survive. Due to the advertisement reports sent by the first node it is also determined which is the optimum subcluster.
  • calculating a delay time depending on the weight of the first node includes calculating a delay time depending on the weight of the sub- cluster defined by or in the advertisement reports.
  • inspecting the connection comprises the steps of listening to a heartbeat message sent by the at least one second node over the communication link and setting a failure indicator if the heartbeat message of the at least one second node is not received during a specific amount of time.
  • the heartbeat message which is in a preferred embodiment a periodic signal is sent over the communication network which connects the first node and the at least one second node together. It is also sent over the communication link in said communication network, over which the first node and the at least one second node also communicate with each other. If the heartbeat message is not received during a specific amount of the time the first node assumes, that the communication link or the at least one second node is broken or down.
  • the failure indica- tor set by the first node indicates the at least one second node which is to be shut down to prevent data corruption among the cluster and on a shared device especially.
  • step b) of the method comprises the steps of waiting a specific amount of time after a failure of a communication link is received before an additional failure of communication link between the first node and a second of the at least one second node and then determining the at least one second node to be shut down. Waiting for other failure indicators prevents a wrong error indicator due to an overloading communication link. It further allows to consume all failure indicators before determining the nodes to be shut down.
  • creating an advertisement report comprises also determining a node for which no failure of communication is supported as a master node.
  • the first node will define a node of the computer cluster with a working channel as a master node. Defining a master node allows to specify and identify all nodes of a subcluster system easily and dynamically.
  • the master node is the node with the lowest alphanumeric name for which no failure of communication is reported or received. Therefore the first node will declare a working node with the lowest alphanumeric name as the master node.
  • This embodiment of the invention is an easy way to define and identify nodes in a subcluster system. Furthermore it allows dynamic change of the total cluster.
  • a further embodiment of the invention comprises the step of creating at least one list including the name of the first node, the name of the master node and the name of the at least one second node to be shut down.
  • This list is preferable part of the advertisement reports. Therefore the first node creates a report comprising a list including its name, the name of the master node determined in the previous step and the name of the node for which a failure message is received. It is preferred to create an advertisement report for each of the second nodes to be shut down. Such an embodiment will be preferable if there is more than one second node for which a failure message is received.
  • the list of the advertisement report also includes the host weight of the first node.
  • the host weight might include a machine weight and an individual application weight of executed applications on the first node.
  • sending the advertisement report comprises sending the advertisement report to each node of the first and at least one second of the cluster.
  • the advertisement report will be received by each node in the cluster. This allows each node to determine their own subcluster and also to calculate the total subcluster weight compared to the total cluster weight .
  • the calculated delay time in step e) can be set to zero, if the host weight assigned to the first node is greater than 50% of the total weight of the first and the at least one second node. Since the first node has the greatest weight of the total cluster system the first node can automatically begin to send shutdown commands to the at least one second node determined to be shutdown. The surviving subcluster which will include the first node will be the optimal subcluster.
  • the delay time calculated in step e) of the inventive method is set to zero if the sum of the weight of the first node and the nodes for which no failure of communications received is greater than 50% of the total weight of the first node and the at least one second node.
  • the delay time is set to zero if the weight of the nodes which belong to the same subcluster exceeds 50% of the total cluster weight. Nodes which belong to the same sub- cluster have the same node declared as master node. In other words, if the weight of this subcluster exceeds 50% than one of those nodes can start immediately sending shutdown commands to the other nodes, for which a failure report is re- ceived.
  • the shutdown is sent to the at least one second node after lacking an indicator signal of the at least one second node indicating a shutdown process.
  • a node will send a second shutdown command if another node of the same subcluster which has a shorter delay time has not sent a successful shutdown command to the at least one second node. If an indicator signal is therefore lacking the node must assume, that a problem occurred and the at least one second node has not performed a shutdown process yet.
  • the advertisements report via the UDP protocol .
  • This protocol has less overhead than a normal TCP/IP protocol.
  • all communication regarding the method is sent over an administrative network, which connects every node in the cluster system.
  • Figure 1 shows a cluster system with five nodes, each includ- ing the inventive method
  • Figure 2 gives an overview over the logical level structure of one node of this cluster
  • Figure 3 shows the cluster structure with a broken communication link between nodes
  • Figure 4 shows a cluster system consisting of three nodes with a split brain condition
  • Figure 5 shows the method steps in each node of the cluster in the previous figure
  • Figure 6 illustrates the same cluster in a different split brain condition
  • Figure 7 shows the method steps for each node in the cluster of the previous figure
  • Figure 8 shows a detailed illustration of the inventive method.
  • FIG. 1 illustrates a typical cluster system with five separated cluster nodes NI to N5 in which the inventive method is implemented.
  • the cluster nodes NI to N5 are normal computers having a central processor unit, a memory and at least one network card as well as a connection to a storage device SD.
  • the network cards in each of the nodes NI to N5 connect the nodes NI to N5 to each other via a communication network.
  • the communication network establishes a communication link CN between the nodes.
  • the communication link is also called cluster network which allows the nodes NI to N5 of the computer cluster to communicate with each other.
  • the communication link CN is also connected to a software program called Cluster Foundation in each node, which works as a logical layer in each node and inherits all basic cluster functionality .
  • the software Cluster Foundation is running on each of those nodes and controls and monitors the communication between the nodes NI to N5. The communication itself is performed over the Cluster Foundation IP protocol CFIP. Cluster software running on the nodes communicates over the CFIP and the com- munication link CN with other nodes.
  • the communication network includes an administration network AN which is also connected to each of those nodes. Commands not concerning the cluster software are sent over the administrative network AN. For example over the administrative network AN a system administrator can send shutdown commands to one of those nodes .
  • each node of the cluster is connected to a shared storage device SD. Data are loaded from the storage device SD or written into the storage device SD via the storage device communication SDC.
  • Cluster software running on one of those nodes and communicating with other cluster software on other nodes are controlling the reading and writing on the storage device in a way that no data inconsistency or data corruption occurs. For example, if a cluster software running on node 2 tries to read a file from the shared storage device SD over the communication network SDC another cluster software on node N3 will not write data in that file until the read process is fin- ished.
  • the access to the shared storage device SD might also be controlled by the cluster foundation CF and the cluster foundation IP protocol CFIP on each of the cluster nodes.
  • each node in the cluster system has a name as well as a weight assigned to it.
  • each of those NI to N5 have the weight 1.
  • the weight is a measure of the importance of a specific node in the total cluster system.
  • the total cluster weight is calculated adding the host weight of each node.
  • the node NI has the hot weight 1 which is one-fifth of the total cluster weight.
  • the host weight of NI consists of the local machine weight which is given in this embodiment by the physical parameters of the host, for example CPU-type, memory size and so on.
  • the host weight includes a user application weight.
  • the user application weight gives information about the executed applications on that specific node and includes also a user definable weight .
  • FIG. 2 shows a sketch of a logical layer structure of the node NI.
  • the other nodes N2 to N5 in the cluster of Figure 1 will have the same layer structure.
  • the node NI includes the cluster foundation layer CF including the cluster foundation IP protocol CFIP which allows to communicate the cluster software with each node.
  • the cluster foundation CF also controls, maintains and monitors the cluster foundation IP pro- tocol and the communication between the nodes. If a node shuts down due to an administrative shutdown command for example, the cluster foundation CF of this node sends out a signal NODE_DOWN to indicate all other nodes that the node NI will shut down immediately.
  • Cluster foundations CF on other nodes will receive that NODE_DOWN signal and change their priority and, for example, take over the programs executed on said node .
  • This task is also performed by the next layer comprising of a reliant monitoring system RMS.
  • the RMS responsible for the high availability of user applications.
  • the reliant monitoring system starts and stops cluster software and also monitors the user application weight.
  • the layer includes also the shutdown facility SF.
  • the shutdown facility SF receives failure messages from the cluster foundation CF if a communication link between one node and another node is broken. A broken communication is assumed to be a split-brain condition. Therefore, the shutdown facility SF has to send a shutdown command over the administrative network AN to the node to be shut down. It also sends out a signal node NODE_LEFT_DOWN_COMBO to inform all remaining cluster nodes of the status change. Furthermore it receives a signal indicating the shutdown progress by the node to be shut down.
  • FIG. 3 An example of a possible split-brain condition is shown in Figure 3.
  • the cluster network has been split between node N2 and node N3.
  • the shared device communication SDC is not split as well as the administrative network AN.
  • This situation results in two subclusters which still share the same storage device SD.
  • One subcluster SCI consists of two nodes NI and N2.
  • the second subcluster SC2 includes the nodes N3 , N4 and N5.
  • a cluster communication over the cluster IP protocol CFIP between node NI and node N2 as well as node N3 , node N4 and node N5 is still possible.
  • using the same shared device leads to inconsistencies. Therefore one of those subclusters has to be shut down.
  • the method used by the cluster foundation CF and the shutdown facility SF in the nodes is shown in Figure 8.
  • the inventive method is implemented in the shutdown facility and is performed by a special program, which takes care of the shutdown process. It can also be split in different programs or implemented in a different way. However the method steps will be similar.
  • the method of figure 8 is shown for node NI for clearness only.
  • the cluster foundation inspects the communication links.
  • the cluster foundation CF sends over the communication link CN heartbeat messages to each of those nodes. As long as heartbeat messages are received from each other node the communication link is considered working and intact. If a heartbeat message from a specific node is not received over a specific amount of time it is assumed that the communication with that specific node is down. The specific amount of time can be changed due to heavy load in the communication link.
  • the nodes NI and N2 will stop receiving heartbeat messages of N3 to N5. At around the same time nodes N3 to N5 will not receive any heartbeat messages of NI and N2 anymore.
  • the cluster foundation After the cluster foundation has notified a failed communica- tion it creates a failure indicator and sends this failure indicator to the shutdown facility SF.
  • the shutdown facility waits for a short period of time for outstanding and additional failure indicators. This delay by the shutdown facility must be at least as long as the time between the receiv- ing of two heartbeat messages by the cluster foundation. Additional failure indicators indicating that the communication of other nodes are also down are collected by the shutdown facility during the delay time.
  • the shutdown facility SF of node NI will first receive a failure indicator for node N4 and the wait for another 1 to 2 second. It will shortly afterward receive the failure indicators of node N5 and N3 sent by the cluster foundation.
  • the cluster foundation CF of nodes N3 to N5 will create and send only two failure indicators to the shutdown facilities SF on those nodes.
  • One failure indicator indicate node NI down, the other marks N2 as down.
  • the shutdown facility determines which nodes shall be killed to solve the split-brain condition.
  • the shut- down facilities of node NI and N2 in this example both declare nodes N3 , N4 and N5 to be shut down. Accordingly the shutdown facilities of nodes N3 to N5 declare node NI and N2 to be shut down.
  • the shutdown facility SF on each node calculates the local host weight.
  • the local host weight For this purpose it uses the reliant monitoring system RMS which provides a user application weight. It also has information about the local machine weight. This is given by a table list, which is stored on each node and have the same entries throughout the cluster. The sum of both weights is the total local host weight.
  • the shutdown facility determines whether the weight is greater than 50% of the total cluster weight. If yes it can immediately start shutting down all other nodes in the cluster, which are to be shut down because even the sum of the weights of all nodes to be shut down cannot outrank its total local weight. This step can also be left out, or delayed.
  • the node to be shut down are the nodes for which no heartbeat message was received.
  • a shutdown facility determines the master node of its subcluster.
  • the master node of a subcluster is the node with the lowest alphanumeric number or name, for which the communication link is still working.
  • the node NI has the lowest alphanumeric name and is considered as master of that subcluster.
  • node N3 is considered master of the subcluster.
  • the shutdown facility SF of node N4 has received a failure indicator for node NI and N2 but not for node N3. It therefore assumes that node N3 has still an active communication link and declares node N3 as master node for the subcluster SC2.
  • the shutdown facility SF of node N5 will come to the same conclusion and the shutdown facility SF of the node N3 will declare itself as master node.
  • the shutdown facility of N2 will declare node NI as master and the shutdown facility of node NI will declare its own node as master node.
  • the step of determining the master nod of their subcluster and the step of calculating the total host wieght can also be switched.
  • the shutdown facilities of each node will create an advertisement report for each node to be shut down of the other subcluster.
  • the advertisement reports include the name of the local node, the name of the determined master node, the local weight as well as the node to be shut down. If the local weight does not include a user application weight and is known due to the table entry, the weght can be left out .
  • the advertisement reports are then sent over the administrative network to each of the other nodes of the total cluster.
  • the nodes NI and N2 will send three advertisement reports with shutdown requests for the nodes N3 to N5.
  • the nodes N3 to N5 of the subcluster SC2 will send two advertisement reports.
  • the three advertisement reports of node N2 will look similar to the example in the table below:
  • the shutdown facility then waits for a specific amount of time for collecting the advertisement reports sent by the other nodes.
  • the shutdown facility of node NI will receive an advertisement report for shutting down nodes N3 to N5 from node N2.
  • the shutdown facility will also receive an advertisement report by the nodes N3 to N5 for shutting down nodes NI and N2.
  • the shutdown facilities of each node will, after collecting all the reports, determine the subcluster to which they belong. For this step they will use the declared master node in each report .
  • the nodes which have declared the same node as master node are supposed to belong to the same subcluster.
  • the shutdown facilities of N5 and N4 have declared the node N3 as their master. Therefore the nodes N4 and N5 belong to the same subcluster SC2 as node N3.
  • the nodes NI and N2 which both declare node NI as master node belong to the subcluster SCI .
  • the shutdown facility calculates the subclus- ter weight. This is simply done by adding the local cluster weight in each advertisement report sent by a node belonging to the same subcluster. If the subcluster weight exceeds 50% then the shutdown facility of the master node of that sub- cluster can automatically start sending shutdown commands to the nodes of the other subcluster, because the other sub- custer can not outrank it .
  • node N3 has received the advertisement reports of node N4 and N5, which be- long the same subcluster.
  • the local weight of N4 , N5 and N3 gives 3, which is greater than 50% of the total cluster weight of 5. Therefore node N3 can immediately start sending shutting down commands to node NI and N2 of the subcluster SCI.
  • the split-brain condition is assumed to be a three-way split.
  • the master node of the 50% subcluster can immediately sending shutting down commands to all other nodes not in its subcluster or the nodes determined to be shut down. Still no other subcluseter can outrank it.
  • each the subcluster which contains the lowest alphanumeric name begins sending shutdown commands to the other subclusters first.
  • the surviving subcluster will therefore contain the node with the lowest alphanumeric node name. It is also possible to name other parameter than the node name in the case of an exactly 50% split.
  • a subcluster weight of smaller than 50% for each subcluster occur, if not all node send an advertisment report or declare a specific node as master. If the subcluster weight is smaller than 50%, each of the shutdown facilities in the sub- cluster calculate a delay time. This delay time depends on the local weight of the local node and also of the position of the node in the subcluster. Additionally the delay time should include a value which is the sum of all timeouts of all shutdown facilities to be used in the subcluster.
  • the shutdown facility SF of node NI will wait for five seconds before starting the shutting down commands.
  • the shutdown facility of node N2 and sub- cluster SCI will wait for five seconds plus another two seconds representing the second position in the ranking of sub- cluster SCI.
  • the shutdown facility SF checks for an indicating signal. This signal indicates whether the shutdown process of the nodes N3 to N5 to be shut down have already begun. If that is the case and all nodes to be shut down have sent their signal indicating the shutting down process the facility can stop here. If an indication signal is not received then the shutdown facility assumes that a prior shutdown facility with a shorter delay time had some problems sending the shutting down signal. It therefore starts immediately to shut down the nodes of the different subclusters. This is a fail safe mechanism.
  • the master node of a subcluster normally gets a shortest delay time compared to all other nodes in that subcluster. Hence it will start sending the shutting down process to all other nodes before the delay time of any other node in that subcluster expires. Therefore it is necessary to ensure that no shutdown command was sent before starting the failsafe mechanism in other nodes. This will prevent a node from receiving two shutdown commands in a short time normally causing panic or a computer crash on that node.
  • FIG. 4 Another embodiment of this invention is presented in Figure 4. It shows a cluster comprising three different; nodes NI, N2 and N3 which are connected to a cluster network CN and to an administrative network AN.
  • a cluster communication between the node NI and the node N3 is down while the communication between node NI and node N2 as well as node N2 and node N3.
  • the cluster foundation of the node NI and the node N3 will not receive heartbeat messages from each other send a failure indicator for the other node at about the same time to the shutdown fa- cility.
  • the node N2 can still communicate with the node NI and the node N3 and therefore receives no failure indicator.
  • Figure 5 shows an overview of the actions taken by the shut- down facilites SF of the nodes in the cluster.
  • the shutdown facility SF in node NI will first determine node N3 as the node to be shut down, while the shutdown facility of node N3 determines NI as node to be shut down. Because no communication failure is reported with node N2 , the shutdown facility of node NI assumes that node N2 is in the same subcluster. Since node NI has a lower alphanumeric than a node N2 the shutdown of node NI declares node NI as master node.
  • the shutdown facility of node N3 as- sumes node N2 and node N3 to be in the same subcluster and declares node N2 as a master node of that subcluster. Both shutdown facilities calculate their weight and generate the advertisement reports requesting a shutdown progress for the other node. They will then send those advertisement reports over the administrative network AN. The shutdown facility of node N2 receives those advertisement reports but does not take any action, because it can still communicate with both nodes and therefore will automatically belong to the surviving subcluster.
  • the shutdown facilities delay for one to two seconds waiting for all advertisements send by other nodes.
  • the shutdown facility of NI receives the advertisement of node N3 and the shutdown facility of N3 receives the advertisements of the shutdown facility of NI .
  • the delay time is calculated based on the received advertisements.
  • the shutdown facilities are considering only their own weight of 33% of the total cluster weight because the shutdown facility of node N2 has not advertised. Thus the shutdown facilities of nodes NI and N3 cannot assume that node N2 is part of their subcluster.
  • the shutdown facility of node N3 which declared node N2 as master node of their subcluster adds some additional time to the calculated delay time, due to the fact of not being the master node of the subcluster. Therefore the total calculated delay time of the shutdown fa- cility of node NI is shorter than the delay of shutdown facility of node N3.
  • the shutdown facility SF of NI sent the shutdown command to node N3.
  • Normally node N3 would start to shut down and no shutdown command would be send by the facility of N3.
  • Node NI and N2 would be the remaining nodes .
  • the kill command over the administrative network AN is not received by N3 due to a temporary failure of transmission.
  • the delay time of node N3 expires without having received a shutdown signal.
  • the shutdown facility SF of N3 now assumes that even though it is not the selected subcluster to survive the process, the highest weight subcluster, comprising of node NI and node N2 is not performing the node elimination it should be doing. Therefore it is sending a shutdown command to node NI and node NI shuts down.
  • the split-brain condition is solved, the surviving subcluster of node N2 and node N3 is not the optimal subcluster due to its weight of 3 compared to the optimal subcluster weight of node NI and node N2.
  • Figure 6 shows another aspect of the invention.
  • a cluster consisting of three nodes and NI, N2 and N3 which are connected to a cluster network CN and an administrative network AN .
  • the communication between the nodes ⁇ 2 and N3 as well as the nodes NI and N3 is broken.
  • the cluster foundations CF transmit a failure indicator after the heartbeat messages are not received for some time.
  • the shutdown facility of node NI and node N2 receives a failure indicator for node N3
  • the shutdown facility of node N3 receives a failure indicator for node NI and node N2.
  • the shutdown facilities determine the masters of the respective subclusters.
  • the node NI is declared master by the shutdown facilities of node NI and node N2, the shutdown facility of node N3 declares itself as master, since communication with another node is not possible any more.
  • the total weight is calculated.
  • the total weight of the subcluster of node NI and node N2 is 11 while the total weight of subcluster N3 is only 10.
  • the advertisement reports are created.
  • the shutdown facility of node N2 will send the advertisement report only to the shutdown facility of node NI, while the shutdown facility of node NI will only send its advertisement report to the facility of N2. It will not send the advertisement report to N3.
  • the shutdown facility of N3 does not need to send the advertisement reports to NI or N2.
  • the shutdown facility of node NI calculates the total subcluster weight to be greater than 50% of the total cluster weight. It sets its calculated time to zero and starts sending the shutdown process to node N3 immediately. After some time node NI and node N3 should receive a signal indicating that node N3 has been shut down. The split- brain condition is solved.
  • the described method in this invention is easily adaptable to different cluster systems. Furthermore for the calculation of the local as well as of the subcluster weight it is possible to include other weight values than just the local machine and the user application weight.
  • the machine weight of each node in the cluster system is written in a configuration file. It is useful to generate a "LEFTCLUSTER" signal by the cluster foundation CF which is broadcast to all other cluster nodes in the surviving cluster indicating the change of the cluster structure.
  • the delay time is calculated using the local weight, the node name and the weight of the surviving subcluster. If not all cluster nodes in a subcluster have advertised their weights it is necessary to rely on an algorithm to allow the greatest weight subcluster to delay the least time as well as a factor in the bias of the alphanumeric node name. A possible solution for this delay is given by the formula:
  • the factor includes a relative ranking of the nodes in sub- cluster as well as a relative weight compared to the total cluster weight.
  • the formula should result in a delay time, where nodes in a subcluster of a small weight compared to the total cluster receive a very large delay time. Nodes of a subcluster whose relative weight is high will calculate a small delay time. The delay time between nodes in one sub- cluster remain different and depending on the relative ranking in that subcluster.
  • This invention makes the communication failure indicators as well as the shutdown request local knowledge to one specific node of the cluster. Therefore it is necessary that determination of the membership of the subclusters must wait until all shutdown requests have been advertised, sent and re- ceived. However it is not necessary to send advertisement reports to other nodes than the members of their own subcluster.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer And Data Communications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

In a cluster system comprising at least two nodes connected via a communication network and having a name and a host weight assigned to it a method is implemented comprising the steps of inspecting the communication link, determining which node has to be shut down after a failure, creating an advertisement report for the node to be shut down, sending the advertisement report to at least one node of the cluster system, calculating a delay time depending on the weight of the first node and sending the shutdown command to the node for which a failure report was received. In a preferred embodiment of the invention the advertisement reports include a master node, which allows identifying and specifying the surviving subcluster. The method will send shutdown signals to those nodes of a subcluster with lower weight than the surviving subcluster. A failsafe mechanism is implemented.

Description

METHOD OF SOLVING A SPLIT-BRAIN CONDITION
The invention relates to a method in a cluster system comprising a first and at least one second node, said nodes con- nected to a communication network and having a name and a host weight assigned to it, the method implemented in at least one of the first and of the at least one second node.
Computer cluster systems consist of individual computers called nodes which are connected via a communication network. The communication network allows them to establish a communication link or channel in between two nodes. Often computer clusters also consist of a shared storage device which is connected to each of the nodes of the computer cluster. On those shared device some data is stored, which is used by more than one node in the cluster. To prevent data inconsistency, means for data transmission between the nodes and the shared devices are required. For example if one node in the cluster writes data in a file on the shared storage device, a second node is not allowed to read that file until the first node has finished the writing process. In normal conditions the first node writing the data in the file of the shared device will tell the second node of the writing process, thereby preventing the second node reading the now outdated file. This task is done via the working communication channel between the two nodes .
If one node in the computer cluster breaks down, it will normally stop using the shared device. Other nodes in the com- puter cluster can use the data on the shared device without the risk of data corruption. However, if the communication channel between two nodes breaks down such that the members of the cluster are still operating yet cannot communicate with each other, data corruption on the shared devices can occur. A breakdown of the communication channel is called split-brain condition. In this case a node in one of the resulting subclusters might write data in the file on a shared device while a second node in the other resulting subcluster reads or writes the file at the same time. Thus a breakdown communication channel might lead to uncoordinated activities on shared devices. Therefore, it is necessary to shut down one of the resulting subclusters completely.
A shutdown process of a subcluster system is normally done in the way that the nodes of one subcluster are sending shutdown commands to the nodes of the second subcluster. However this can lead to the situation, that a node of one subcluster is target of multiple shutdown requests that may cause panic and undesired crashes among the nodes receiving those requests. Furthermore the members of the surviving subcluster might not be known prior to the beginning of the shutdown attempts. This might lead to the situation that a non-optimal subcluster will survive, which is not be able to handle all necessary applications running on the cluster system.
This task of this invention is to provide a method in a clus- ter system for a shutdown process in a split-brain condition.
The method should solve a split-brain condition with a remaining defined and determined subcluster.
The task is solved using the method with features of the pre- sent invention.
The inventive method is implemented in a cluster system, comprising a first and at least one second node. The said nodes are connected to a communication network and are able to com- munication with each other via the communication network.
Each of said nodes is having a name and a host weight assigned to it. The method comprises the steps of: a) Inspecting a communication link via the computer network between the first and the at least one second node; b) Determining which of the at least one second node is to be shut after a failure message of the communication link via the communication network between the first and the at least one second node is received; c) Creating an advertisement report for the at least one second node determined to be shut down; d) Sending the advertisement report to at least one node of the cluster system comprising the first and the at least one second node; e) Calculating a delay time depending at least on the weight of the first node; f) Sending a shutdown command to the at least one second node after the expiry of the calculated delay time.
This method introduces a particular order in which the shutdown commands are sent. Each shutdown command is sent after the expiry of a calculated delay time, which is depending of at least the weight of the first node. Calculating the delay time will ensure that in a split-brain condition the subcluster with the greatest weight will automatically survive. Due to the advertisement reports sent by the first node it is also determined which is the optimum subcluster.
In a preferred embodiment of the invention calculating a delay time depending on the weight of the first node includes calculating a delay time depending on the weight of the sub- cluster defined by or in the advertisement reports.
In another preferred embodiment of the invention inspecting the connection comprises the steps of listening to a heartbeat message sent by the at least one second node over the communication link and setting a failure indicator if the heartbeat message of the at least one second node is not received during a specific amount of time. The heartbeat message, which is in a preferred embodiment a periodic signal is sent over the communication network which connects the first node and the at least one second node together. It is also sent over the communication link in said communication network, over which the first node and the at least one second node also communicate with each other. If the heartbeat message is not received during a specific amount of the time the first node assumes, that the communication link or the at least one second node is broken or down. The failure indica- tor set by the first node indicates the at least one second node which is to be shut down to prevent data corruption among the cluster and on a shared device especially.
In another preferred embodiment of the invention step b) of the method comprises the steps of waiting a specific amount of time after a failure of a communication link is received before an additional failure of communication link between the first node and a second of the at least one second node and then determining the at least one second node to be shut down. Waiting for other failure indicators prevents a wrong error indicator due to an overloading communication link. It further allows to consume all failure indicators before determining the nodes to be shut down.
In a further preferred embodiment of the invention creating an advertisement report comprises also determining a node for which no failure of communication is supported as a master node. In this embodiment of the invention the first node will define a node of the computer cluster with a working channel as a master node. Defining a master node allows to specify and identify all nodes of a subcluster system easily and dynamically.
In a preferred embodiment of this invention the master node is the node with the lowest alphanumeric name for which no failure of communication is reported or received. Therefore the first node will declare a working node with the lowest alphanumeric name as the master node. A second node, which declares a node with the same alphanumeric name as master node belongs to the same subcluster, if no failure indicator is reported for this node. This embodiment of the invention is an easy way to define and identify nodes in a subcluster system. Furthermore it allows dynamic change of the total cluster.
A further embodiment of the invention comprises the step of creating at least one list including the name of the first node, the name of the master node and the name of the at least one second node to be shut down. This list is preferable part of the advertisement reports. Therefore the first node creates a report comprising a list including its name, the name of the master node determined in the previous step and the name of the node for which a failure message is received. It is preferred to create an advertisement report for each of the second nodes to be shut down. Such an embodiment will be preferable if there is more than one second node for which a failure message is received.
In another preferred embodiment of the invention the list of the advertisement report also includes the host weight of the first node. The host weight might include a machine weight and an individual application weight of executed applications on the first node.
In another preferred embodiment sending the advertisement report comprises sending the advertisement report to each node of the first and at least one second of the cluster. In this embodiment of the invention the advertisement report will be received by each node in the cluster. This allows each node to determine their own subcluster and also to calculate the total subcluster weight compared to the total cluster weight .
The calculated delay time in step e) can be set to zero, if the host weight assigned to the first node is greater than 50% of the total weight of the first and the at least one second node. Since the first node has the greatest weight of the total cluster system the first node can automatically begin to send shutdown commands to the at least one second node determined to be shutdown. The surviving subcluster which will include the first node will be the optimal subcluster. In another preferred embodiment of the invention the delay time calculated in step e) of the inventive method is set to zero if the sum of the weight of the first node and the nodes for which no failure of communications received is greater than 50% of the total weight of the first node and the at least one second node. In a further preferable embodiment of the invention the delay time is set to zero if the weight of the nodes which belong to the same subcluster exceeds 50% of the total cluster weight. Nodes which belong to the same sub- cluster have the same node declared as master node. In other words, if the weight of this subcluster exceeds 50% than one of those nodes can start immediately sending shutdown commands to the other nodes, for which a failure report is re- ceived.
In another preferred embodiment of this invention the shutdown is sent to the at least one second node after lacking an indicator signal of the at least one second node indicating a shutdown process. In that case a node will send a second shutdown command if another node of the same subcluster which has a shorter delay time has not sent a successful shutdown command to the at least one second node. If an indicator signal is therefore lacking the node must assume, that a problem occurred and the at least one second node has not performed a shutdown process yet.
It is preferred to send the advertisements report via the UDP protocol . This protocol has less overhead than a normal TCP/IP protocol. In another preferred embodiment of the invention all communication regarding the method is sent over an administrative network, which connects every node in the cluster system.
The present invention can be more fully understood from the detailed description provided herein after with reference to the accompanying drawings which are given by way of illustra- tion only and thus not limitative of the present invention and wherein.
Figure 1 shows a cluster system with five nodes, each includ- ing the inventive method;
Figure 2 gives an overview over the logical level structure of one node of this cluster;
Figure 3 shows the cluster structure with a broken communication link between nodes;
Figure 4 shows a cluster system consisting of three nodes with a split brain condition;
Figure 5 shows the method steps in each node of the cluster in the previous figure;
Figure 6 illustrates the same cluster in a different split brain condition;
Figure 7 shows the method steps for each node in the cluster of the previous figure;
Figure 8 shows a detailed illustration of the inventive method.
Figure 1 illustrates a typical cluster system with five separated cluster nodes NI to N5 in which the inventive method is implemented. The cluster nodes NI to N5 are normal computers having a central processor unit, a memory and at least one network card as well as a connection to a storage device SD. The network cards in each of the nodes NI to N5 connect the nodes NI to N5 to each other via a communication network. The communication network establishes a communication link CN between the nodes. The communication link is also called cluster network which allows the nodes NI to N5 of the computer cluster to communicate with each other. The communication link CN is also connected to a software program called Cluster Foundation in each node, which works as a logical layer in each node and inherits all basic cluster functionality . The software Cluster Foundation is running on each of those nodes and controls and monitors the communication between the nodes NI to N5. The communication itself is performed over the Cluster Foundation IP protocol CFIP. Cluster software running on the nodes communicates over the CFIP and the com- munication link CN with other nodes.
Furthermore, the communication network includes an administration network AN which is also connected to each of those nodes. Commands not concerning the cluster software are sent over the administrative network AN. For example over the administrative network AN a system administrator can send shutdown commands to one of those nodes . Furthermore each node of the cluster is connected to a shared storage device SD. Data are loaded from the storage device SD or written into the storage device SD via the storage device communication SDC.
Cluster software running on one of those nodes and communicating with other cluster software on other nodes are controlling the reading and writing on the storage device in a way that no data inconsistency or data corruption occurs. For example, if a cluster software running on node 2 tries to read a file from the shared storage device SD over the communication network SDC another cluster software on node N3 will not write data in that file until the read process is fin- ished. The access to the shared storage device SD might also be controlled by the cluster foundation CF and the cluster foundation IP protocol CFIP on each of the cluster nodes.
Furthermore, each node in the cluster system has a name as well as a weight assigned to it. In this embodiment of the invention each of those NI to N5 have the weight 1. The weight is a measure of the importance of a specific node in the total cluster system. The total cluster weight is calculated adding the host weight of each node. In this embodiment of the invention the node NI has the hot weight 1 which is one-fifth of the total cluster weight. The host weight of NI consists of the local machine weight which is given in this embodiment by the physical parameters of the host, for example CPU-type, memory size and so on. Furthermore the host weight includes a user application weight. The user application weight gives information about the executed applications on that specific node and includes also a user definable weight .
Figure 2 shows a sketch of a logical layer structure of the node NI. The other nodes N2 to N5 in the cluster of Figure 1 will have the same layer structure. The node NI includes the cluster foundation layer CF including the cluster foundation IP protocol CFIP which allows to communicate the cluster software with each node. The cluster foundation CF also controls, maintains and monitors the cluster foundation IP pro- tocol and the communication between the nodes. If a node shuts down due to an administrative shutdown command for example, the cluster foundation CF of this node sends out a signal NODE_DOWN to indicate all other nodes that the node NI will shut down immediately. Cluster foundations CF on other nodes will receive that NODE_DOWN signal and change their priority and, for example, take over the programs executed on said node .
This task is also performed by the next layer comprising of a reliant monitoring system RMS. The RMS responsible for the high availability of user applications. The reliant monitoring system starts and stops cluster software and also monitors the user application weight.
The layer includes also the shutdown facility SF. The shutdown facility SF receives failure messages from the cluster foundation CF if a communication link between one node and another node is broken. A broken communication is assumed to be a split-brain condition. Therefore, the shutdown facility SF has to send a shutdown command over the administrative network AN to the node to be shut down. It also sends out a signal node NODE_LEFT_DOWN_COMBO to inform all remaining cluster nodes of the status change. Furthermore it receives a signal indicating the shutdown progress by the node to be shut down.
An example of a possible split-brain condition is shown in Figure 3. In the example the cluster network has been split between node N2 and node N3. However, the shared device communication SDC is not split as well as the administrative network AN. This situation results in two subclusters which still share the same storage device SD. One subcluster SCI consists of two nodes NI and N2. The second subcluster SC2 includes the nodes N3 , N4 and N5. A cluster communication over the cluster IP protocol CFIP between node NI and node N2 as well as node N3 , node N4 and node N5 is still possible. However using the same shared device leads to inconsistencies. Therefore one of those subclusters has to be shut down.
The method used by the cluster foundation CF and the shutdown facility SF in the nodes is shown in Figure 8. The inventive method is implemented in the shutdown facility and is performed by a special program, which takes care of the shutdown process. It can also be split in different programs or implemented in a different way. However the method steps will be similar. The method of figure 8 is shown for node NI for clearness only.
As mentioned earlier the cluster foundation inspects the communication links. The cluster foundation CF sends over the communication link CN heartbeat messages to each of those nodes. As long as heartbeat messages are received from each other node the communication link is considered working and intact. If a heartbeat message from a specific node is not received over a specific amount of time it is assumed that the communication with that specific node is down. The specific amount of time can be changed due to heavy load in the communication link. In the example of figure 3 the nodes NI and N2 will stop receiving heartbeat messages of N3 to N5. At around the same time nodes N3 to N5 will not receive any heartbeat messages of NI and N2 anymore.
After the cluster foundation has notified a failed communica- tion it creates a failure indicator and sends this failure indicator to the shutdown facility SF. The shutdown facility waits for a short period of time for outstanding and additional failure indicators. This delay by the shutdown facility must be at least as long as the time between the receiv- ing of two heartbeat messages by the cluster foundation. Additional failure indicators indicating that the communication of other nodes are also down are collected by the shutdown facility during the delay time. In the example the shutdown facility SF of node NI will first receive a failure indicator for node N4 and the wait for another 1 to 2 second. It will shortly afterward receive the failure indicators of node N5 and N3 sent by the cluster foundation. On the other hand the cluster foundation CF of nodes N3 to N5 will create and send only two failure indicators to the shutdown facilities SF on those nodes. One failure indicator indicate node NI down, the other marks N2 as down.
After the delay the shutdown facility determines which nodes shall be killed to solve the split-brain condition. The shut- down facilities of node NI and N2 in this example both declare nodes N3 , N4 and N5 to be shut down. Accordingly the shutdown facilities of nodes N3 to N5 declare node NI and N2 to be shut down.
In the next step the shutdown facility SF on each node calculates the local host weight. For this purpose it uses the reliant monitoring system RMS which provides a user application weight. It also has information about the local machine weight. This is given by a table list, which is stored on each node and have the same entries throughout the cluster. The sum of both weights is the total local host weight.
The shutdown facility then determines whether the weight is greater than 50% of the total cluster weight. If yes it can immediately start shutting down all other nodes in the cluster, which are to be shut down because even the sum of the weights of all nodes to be shut down cannot outrank its total local weight. This step can also be left out, or delayed. The node to be shut down are the nodes for which no heartbeat message was received.
If the local weight is less than 50% of the total cluster weight a shutdown facility determines the master node of its subcluster. In this preferred embodiment of the invention the master node of a subcluster is the node with the lowest alphanumeric number or name, for which the communication link is still working. For example, in subcluster SCI the node NI has the lowest alphanumeric name and is considered as master of that subcluster. In subcluster SC2 node N3 is considered master of the subcluster.
The shutdown facility SF of node N4 has received a failure indicator for node NI and N2 but not for node N3. It therefore assumes that node N3 has still an active communication link and declares node N3 as master node for the subcluster SC2. The shutdown facility SF of node N5 will come to the same conclusion and the shutdown facility SF of the node N3 will declare itself as master node.
In the subcluster SCI the shutdown facility of N2 will declare node NI as master and the shutdown facility of node NI will declare its own node as master node. The step of determining the master nod of their subcluster and the step of calculating the total host wieght can also be switched. After the calculation of the total host weight and the determination of the subcluster master node, the shutdown facilities of each node will create an advertisement report for each node to be shut down of the other subcluster. In this embodiment the advertisement reports include the name of the local node, the name of the determined master node, the local weight as well as the node to be shut down. If the local weight does not include a user application weight and is known due to the table entry, the weght can be left out . The advertisement reports are then sent over the administrative network to each of the other nodes of the total cluster.
The nodes NI and N2 will send three advertisement reports with shutdown requests for the nodes N3 to N5. The nodes N3 to N5 of the subcluster SC2 will send two advertisement reports. For example the three advertisement reports of node N2 will look similar to the example in the table below:
Figure imgf000015_0001
The shutdown facility then waits for a specific amount of time for collecting the advertisement reports sent by the other nodes. The shutdown facility of node NI will receive an advertisement report for shutting down nodes N3 to N5 from node N2. In this embodiment the shutdown facility will also receive an advertisement report by the nodes N3 to N5 for shutting down nodes NI and N2.
In the next step the shutdown facilities of each node will, after collecting all the reports, determine the subcluster to which they belong. For this step they will use the declared master node in each report . The nodes which have declared the same node as master node are supposed to belong to the same subcluster. For example, the shutdown facilities of N5 and N4 have declared the node N3 as their master. Therefore the nodes N4 and N5 belong to the same subcluster SC2 as node N3. On the other hand, the nodes NI and N2 which both declare node NI as master node belong to the subcluster SCI .
After this step the shutdown facility calculates the subclus- ter weight. This is simply done by adding the local cluster weight in each advertisement report sent by a node belonging to the same subcluster. If the subcluster weight exceeds 50% then the shutdown facility of the master node of that sub- cluster can automatically start sending shutdown commands to the nodes of the other subcluster, because the other sub- custer can not outrank it .
In the example on Figure 3 the shutdown of node N3 has received the advertisement reports of node N4 and N5, which be- long the same subcluster. The local weight of N4 , N5 and N3 gives 3, which is greater than 50% of the total cluster weight of 5. Therefore node N3 can immediately start sending shutting down commands to node NI and N2 of the subcluster SCI.
If the calculated weight is exactly 50% of the total cluster weight and the total weight of nodes of a different subcluster claim less than 50% then the split-brain condition is assumed to be a three-way split. The master node of the 50% subcluster can immediately sending shutting down commands to all other nodes not in its subcluster or the nodes determined to be shut down. Still no other subcluseter can outrank it.
In case of a split condition resulting in two subcluster with 50% weight each the subcluster which contains the lowest alphanumeric name begins sending shutdown commands to the other subclusters first. The surviving subcluster will therefore contain the node with the lowest alphanumeric node name. It is also possible to name other parameter than the node name in the case of an exactly 50% split.
A subcluster weight of smaller than 50% for each subcluster occur, if not all node send an advertisment report or declare a specific node as master. If the subcluster weight is smaller than 50%, each of the shutdown facilities in the sub- cluster calculate a delay time. This delay time depends on the local weight of the local node and also of the position of the node in the subcluster. Additionally the delay time should include a value which is the sum of all timeouts of all shutdown facilities to be used in the subcluster.
For example in the subcluster SCI the shutdown facility SF of node NI will wait for five seconds before starting the shutting down commands. The shutdown facility of node N2 and sub- cluster SCI will wait for five seconds plus another two seconds representing the second position in the ranking of sub- cluster SCI. Finally the shutdown facility SF checks for an indicating signal. This signal indicates whether the shutdown process of the nodes N3 to N5 to be shut down have already begun. If that is the case and all nodes to be shut down have sent their signal indicating the shutting down process the facility can stop here. If an indication signal is not received then the shutdown facility assumes that a prior shutdown facility with a shorter delay time had some problems sending the shutting down signal. It therefore starts immediately to shut down the nodes of the different subclusters. This is a fail safe mechanism.
Thus the master node of a subcluster normally gets a shortest delay time compared to all other nodes in that subcluster. Hence it will start sending the shutting down process to all other nodes before the delay time of any other node in that subcluster expires. Therefore it is necessary to ensure that no shutdown command was sent before starting the failsafe mechanism in other nodes. This will prevent a node from receiving two shutdown commands in a short time normally causing panic or a computer crash on that node.
Another embodiment of this invention is presented in Figure 4. It shows a cluster comprising three different; nodes NI, N2 and N3 which are connected to a cluster network CN and to an administrative network AN. As can be seen, a cluster communication between the node NI and the node N3 is down while the communication between node NI and node N2 as well as node N2 and node N3. In this embodiment of the invention the cluster foundation of the node NI and the node N3 will not receive heartbeat messages from each other send a failure indicator for the other node at about the same time to the shutdown fa- cility. However, the node N2 can still communicate with the node NI and the node N3 and therefore receives no failure indicator.
Figure 5 shows an overview of the actions taken by the shut- down facilites SF of the nodes in the cluster. The shutdown facility SF in node NI will first determine node N3 as the node to be shut down, while the shutdown facility of node N3 determines NI as node to be shut down. Because no communication failure is reported with node N2 , the shutdown facility of node NI assumes that node N2 is in the same subcluster. Since node NI has a lower alphanumeric than a node N2 the shutdown of node NI declares node NI as master node.
At roughly the same time the shutdown facility of node N3 as- sumes node N2 and node N3 to be in the same subcluster and declares node N2 as a master node of that subcluster. Both shutdown facilities calculate their weight and generate the advertisement reports requesting a shutdown progress for the other node. They will then send those advertisement reports over the administrative network AN. The shutdown facility of node N2 receives those advertisement reports but does not take any action, because it can still communicate with both nodes and therefore will automatically belong to the surviving subcluster.
The shutdown facilities delay for one to two seconds waiting for all advertisements send by other nodes. The shutdown facility of NI receives the advertisement of node N3 and the shutdown facility of N3 receives the advertisements of the shutdown facility of NI . The delay time is calculated based on the received advertisements.
The shutdown facilities are considering only their own weight of 33% of the total cluster weight because the shutdown facility of node N2 has not advertised. Thus the shutdown facilities of nodes NI and N3 cannot assume that node N2 is part of their subcluster. The shutdown facility of node N3 , which declared node N2 as master node of their subcluster adds some additional time to the calculated delay time, due to the fact of not being the master node of the subcluster. Therefore the total calculated delay time of the shutdown fa- cility of node NI is shorter than the delay of shutdown facility of node N3.
After the calculated delay time the shutdown facility SF of NI sent the shutdown command to node N3. Normally node N3 would start to shut down and no shutdown command would be send by the facility of N3. Node NI and N2 would be the remaining nodes .
However in this example, as can be seen from Figure 5 the kill command over the administrative network AN is not received by N3 due to a temporary failure of transmission. The delay time of node N3 expires without having received a shutdown signal. The shutdown facility SF of N3 now assumes that even though it is not the selected subcluster to survive the process, the highest weight subcluster, comprising of node NI and node N2 is not performing the node elimination it should be doing. Therefore it is sending a shutdown command to node NI and node NI shuts down. In this embodiment, though the split-brain condition is solved, the surviving subcluster of node N2 and node N3 is not the optimal subcluster due to its weight of 3 compared to the optimal subcluster weight of node NI and node N2.
Figure 6 shows another aspect of the invention. A cluster consisting of three nodes and NI, N2 and N3 which are connected to a cluster network CN and an administrative network AN . The communication between the nodes Ν2 and N3 as well as the nodes NI and N3 is broken. The cluster foundations CF transmit a failure indicator after the heartbeat messages are not received for some time. The shutdown facility of node NI and node N2 receives a failure indicator for node N3 , while the shutdown facility of node N3 receives a failure indicator for node NI and node N2.
In the next step the shutdown facilities determine the masters of the respective subclusters. The node NI is declared master by the shutdown facilities of node NI and node N2, the shutdown facility of node N3 declares itself as master, since communication with another node is not possible any more.
In the next step the total weight is calculated. In this case the total weight of the subcluster of node NI and node N2 is 11 while the total weight of subcluster N3 is only 10. After that the advertisement reports are created. In this embodiment of the invention the shutdown facility of node N2 will send the advertisement report only to the shutdown facility of node NI, while the shutdown facility of node NI will only send its advertisement report to the facility of N2. It will not send the advertisement report to N3. The shutdown facility of N3 does not need to send the advertisement reports to NI or N2.
However, it will wait some time before starting the calculation of the delay time to compensate for the time which is needed for receiving the advertisements. After that the delay time is calculated. The shutdown facility of node NI calculates the total subcluster weight to be greater than 50% of the total cluster weight. It sets its calculated time to zero and starts sending the shutdown process to node N3 immediately. After some time node NI and node N3 should receive a signal indicating that node N3 has been shut down. The split- brain condition is solved.
The described method in this invention is easily adaptable to different cluster systems. Furthermore for the calculation of the local as well as of the subcluster weight it is possible to include other weight values than just the local machine and the user application weight. In a preferred embodiment of the invention the machine weight of each node in the cluster system is written in a configuration file. It is useful to generate a "LEFTCLUSTER" signal by the cluster foundation CF which is broadcast to all other cluster nodes in the surviving cluster indicating the change of the cluster structure.
In this preferred embodiment of the invention the delay time is calculated using the local weight, the node name and the weight of the surviving subcluster. If not all cluster nodes in a subcluster have advertised their weights it is necessary to rely on an algorithm to allow the greatest weight subcluster to delay the least time as well as a factor in the bias of the alphanumeric node name. A possible solution for this delay is given by the formula:
delay = (maximum delay) * factor
The factor includes a relative ranking of the nodes in sub- cluster as well as a relative weight compared to the total cluster weight. The formula should result in a delay time, where nodes in a subcluster of a small weight compared to the total cluster receive a very large delay time. Nodes of a subcluster whose relative weight is high will calculate a small delay time. The delay time between nodes in one sub- cluster remain different and depending on the relative ranking in that subcluster.
This invention makes the communication failure indicators as well as the shutdown request local knowledge to one specific node of the cluster. Therefore it is necessary that determination of the membership of the subclusters must wait until all shutdown requests have been advertised, sent and re- ceived. However it is not necessary to send advertisement reports to other nodes than the members of their own subcluster.

Claims

Claims :
1. A method for a shutdown process in a cluster system comprising a first and at least one second node, said nodes con- nected to a communication network and having a name and a host weight assigned to it, the method implemented in at least one of the first and at least one second node and the method comprising the steps of: a) inspecting a communication link via the communication net- work between the first and the at least one second node; b) determining which of the at least one second node is to be shutdown after a failure of the communication link via the communication network between the first and the at least one second node; c) creating an advertisement report for the at least one second node determined to be shutdown; d) sending the advertisement report to at least one node of the cluster system comprising the first and the at least one second node; e) calculating a delay time depending at least on the weight of the first node; f) sending a shut down command to the at least one second node to be shut down after the expiry of the calculated delay time .
2. The method of claim 1, wherein inspecting the connection comprises the step of:
- listening to heartbeat messages sent by the at least one second node over the communication link and - setting a failure indicator, if the heartbeat messages of the at least one second node is not received during a specific amount of time.
3. The method of claim 1 or 2 , wherein step b) comprises the steps of:
- waiting a specific amount of time after a failure of a communication link is received for an additional failure of a communication link between the first and a second of the at least one second node and
- determining the at least one second node to be shut down.
4. The method of one of claims 1 to 3 , wherein creating an advertisement report comprises the step of determining a node for which no failure of communication is reported as a master node.
5. The method of claim 4, wherein the master node is the node with the lowest alphanumeric name for which no failure of communication is reported.
6. The method of one of claims 4 to 5, wherein creating an advertisement report comprises the step of creating at least one list including the name of the first node, the name of the master node and the name of the at least one second node to be shutdown.
7. The method of claim 6, wherein one list is created for each of the at least one second node to be shutdown.
8. The method of one of claims 6 or 7 , wherein the list includes the host weight of the first node.
9. The method of one of claims 1 to 8, wherein sending advertisement report comprises sending the advertisement report to each node of the first and at least one second node in the cluster.
10. The method of one of claims 1 to 9, wherein in step e) the delay time is set to zero, if the host weight assigned to the first node is greater than 50% of the total weight of the first and the at least one second node.
11. The method of one of claims 1 to 10, wherein in step e) the delay time is set to zero if the sum of the weight of the first node and the nodes for which no failure of communication is received is greater than 50% of the total weight of the first and the at least one second node.
12. The method of one of claims 1 to 11, wherein, sending a shut down command to the at least one second node in step f) is performed after lacking an indicator signal of the at least one second node indicating a shut down process.
13. The method of one of claims 1 to 12, wherein the host weight includes a predefined machine weight and a user application weight, which depends on executed application on the host .
14. The method of one of claims 1 to 13, wherein the advertisement reports are sent via the UDP protocol.
PCT/EP2003/010985 2002-10-07 2003-10-02 Method of solving a split-brain condition WO2004031979A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP03798928A EP1550036B1 (en) 2002-10-07 2003-10-02 Method of solving a split-brain condition in a cluster computer system
DE60318468T DE60318468T2 (en) 2002-10-07 2003-10-02 METHOD FOR SOLVING DECISION-FREE POSSIBILITIES IN A CLUSTER COMPUTER SYSTEM
AU2003276045A AU2003276045A1 (en) 2002-10-07 2003-10-02 Method of solving a split-brain condition
US11/101,720 US7843811B2 (en) 2002-10-07 2005-04-07 Method of solving a split-brain condition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US41678302P 2002-10-07 2002-10-07
US60/416,783 2002-10-07

Publications (2)

Publication Number Publication Date
WO2004031979A2 true WO2004031979A2 (en) 2004-04-15
WO2004031979A3 WO2004031979A3 (en) 2004-07-29

Family

ID=32069953

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2003/010985 WO2004031979A2 (en) 2002-10-07 2003-10-02 Method of solving a split-brain condition

Country Status (5)

Country Link
US (1) US7843811B2 (en)
EP (1) EP1550036B1 (en)
AU (1) AU2003276045A1 (en)
DE (1) DE60318468T2 (en)
WO (1) WO2004031979A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012072344A1 (en) * 2010-12-03 2012-06-07 International Business Machines Corporation Endpoint-to-endpoint communications status monitoring
US8433760B2 (en) 2010-12-03 2013-04-30 International Business Machines Corporation Inter-node communication scheme for node status sharing
US8634330B2 (en) 2011-04-04 2014-01-21 International Business Machines Corporation Inter-cluster communications technique for event and health status communications
WO2014110063A1 (en) * 2013-01-09 2014-07-17 Microsoft Corporation Automated failure handling through isolation
WO2016050074A1 (en) * 2014-09-29 2016-04-07 中兴通讯股份有限公司 Cluster split brain processing method and apparatus
US20210349860A1 (en) * 2020-05-07 2021-11-11 Hewlett Packard Enterprise Development Lp Assignment of quora values to nodes based on importance of the nodes

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6360100B1 (en) 1998-09-22 2002-03-19 Qualcomm Incorporated Method for robust handoff in wireless communication system
US7668541B2 (en) 2003-01-31 2010-02-23 Qualcomm Incorporated Enhanced techniques for using core based nodes for state transfer
US6862446B2 (en) 2003-01-31 2005-03-01 Flarion Technologies, Inc. Methods and apparatus for the utilization of core based nodes for state transfer
US8983468B2 (en) 2005-12-22 2015-03-17 Qualcomm Incorporated Communications methods and apparatus using physical attachment point identifiers
US8982778B2 (en) 2005-09-19 2015-03-17 Qualcomm Incorporated Packet routing in a wireless communications environment
US8509799B2 (en) 2005-09-19 2013-08-13 Qualcomm Incorporated Provision of QoS treatment based upon multiple requests
US8982835B2 (en) 2005-09-19 2015-03-17 Qualcomm Incorporated Provision of a move indication to a resource requester
US9736752B2 (en) 2005-12-22 2017-08-15 Qualcomm Incorporated Communications methods and apparatus using physical attachment point identifiers which support dual communications links
US9078084B2 (en) * 2005-12-22 2015-07-07 Qualcomm Incorporated Method and apparatus for end node assisted neighbor discovery
US9066344B2 (en) 2005-09-19 2015-06-23 Qualcomm Incorporated State synchronization of access routers
US9083355B2 (en) 2006-02-24 2015-07-14 Qualcomm Incorporated Method and apparatus for end node assisted neighbor discovery
US9155008B2 (en) 2007-03-26 2015-10-06 Qualcomm Incorporated Apparatus and method of performing a handoff in a communication network
US8830818B2 (en) 2007-06-07 2014-09-09 Qualcomm Incorporated Forward handover under radio link failure
US9094173B2 (en) 2007-06-25 2015-07-28 Qualcomm Incorporated Recovery from handoff error due to false detection of handoff completion signal at access terminal
EP2227057B1 (en) * 2009-03-04 2012-12-26 Fujitsu Limited Improvements to short-range wireless networks
US8671218B2 (en) * 2009-06-16 2014-03-11 Oracle America, Inc. Method and system for a weak membership tie-break
US8615241B2 (en) 2010-04-09 2013-12-24 Qualcomm Incorporated Methods and apparatus for facilitating robust forward handover in long term evolution (LTE) communication systems
US8806264B2 (en) * 2010-08-30 2014-08-12 Oracle International Corporation Methods for detecting split brain in a distributed system
US8595546B2 (en) * 2011-10-28 2013-11-26 Zettaset, Inc. Split brain resistant failover in high availability clusters
JP5541275B2 (en) * 2011-12-28 2014-07-09 富士通株式会社 Information processing apparatus and unauthorized access prevention method
CN103297396B (en) 2012-02-28 2016-05-18 国际商业机器公司 The apparatus and method that in cluster system, managing failures shifts
JP5790723B2 (en) * 2013-09-12 2015-10-07 日本電気株式会社 Cluster system, information processing apparatus, cluster system control method, and program
US9450852B1 (en) * 2014-01-03 2016-09-20 Juniper Networks, Inc. Systems and methods for preventing split-brain scenarios in high-availability clusters
US10846219B2 (en) * 2015-07-31 2020-11-24 Hewlett Packard Enterprise Development Lp Data copy to non-volatile memory
US10764144B2 (en) * 2016-10-13 2020-09-01 International Business Machines Corporation Handling a split within a clustered environment
US10592342B1 (en) * 2018-02-02 2020-03-17 EMC IP Holding Company LLC Environmental aware witness for active-active storage cluster nodes
US11558454B2 (en) 2018-07-31 2023-01-17 Hewlett Packard Enterprise Development Lp Group leader role queries
CN111651291B (en) * 2020-04-23 2023-02-03 国网河南省电力公司电力科学研究院 Method, system and computer storage medium for preventing split brain of shared storage cluster

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6192401B1 (en) * 1997-10-21 2001-02-20 Sun Microsystems, Inc. System and method for determining cluster membership in a heterogeneous distributed system
EP1117040A2 (en) * 2000-01-10 2001-07-18 Sun Microsystems, Inc. Method and apparatus for resolving partial connectivity in a clustered computing system
EP1117039A2 (en) * 2000-01-10 2001-07-18 Sun Microsystems, Inc. Controlled take over of services by remaining nodes of clustered computing system
US6363495B1 (en) * 1999-01-19 2002-03-26 International Business Machines Corporation Method and apparatus for partition resolution in clustered computer systems

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6467050B1 (en) * 1998-09-14 2002-10-15 International Business Machines Corporation Method and apparatus for managing services within a cluster computer system
US7076783B1 (en) * 1999-05-28 2006-07-11 Oracle International Corporation Providing figure of merit vote from application executing on a partitioned cluster
US6728897B1 (en) * 2000-07-25 2004-04-27 Network Appliance, Inc. Negotiating takeover in high availability cluster
US7228453B2 (en) * 2000-10-16 2007-06-05 Goahead Software, Inc. Techniques for maintaining high availability of networked systems
US7197660B1 (en) * 2002-06-26 2007-03-27 Juniper Networks, Inc. High availability network security systems
US7403993B2 (en) * 2002-07-24 2008-07-22 Kasenna, Inc. System and method for highly-scalable real-time and time-based data delivery using server clusters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6192401B1 (en) * 1997-10-21 2001-02-20 Sun Microsystems, Inc. System and method for determining cluster membership in a heterogeneous distributed system
US6363495B1 (en) * 1999-01-19 2002-03-26 International Business Machines Corporation Method and apparatus for partition resolution in clustered computer systems
EP1117040A2 (en) * 2000-01-10 2001-07-18 Sun Microsystems, Inc. Method and apparatus for resolving partial connectivity in a clustered computing system
EP1117039A2 (en) * 2000-01-10 2001-07-18 Sun Microsystems, Inc. Controlled take over of services by remaining nodes of clustered computing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FUJISTSU SIEMENS COMPUTER: "PRIMECLUSTER for mySAP Business Suite, Shutdown Facility (SF) (Linux), Configuration and Administration Guide" WHITE PAPER, October 2003 (2003-10), pages 1-70, XP002275067 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012072344A1 (en) * 2010-12-03 2012-06-07 International Business Machines Corporation Endpoint-to-endpoint communications status monitoring
US8433760B2 (en) 2010-12-03 2013-04-30 International Business Machines Corporation Inter-node communication scheme for node status sharing
US8634328B2 (en) 2010-12-03 2014-01-21 International Business Machines Corporation Endpoint-to-endpoint communications status monitoring
US8806007B2 (en) 2010-12-03 2014-08-12 International Business Machines Corporation Inter-node communication scheme for node status sharing
US8824335B2 (en) 2010-12-03 2014-09-02 International Business Machines Corporation Endpoint-to-endpoint communications status monitoring
US9553789B2 (en) 2010-12-03 2017-01-24 International Business Machines Corporation Inter-node communication scheme for sharing node operating status
US8634330B2 (en) 2011-04-04 2014-01-21 International Business Machines Corporation Inter-cluster communications technique for event and health status communications
US8891403B2 (en) 2011-04-04 2014-11-18 International Business Machines Corporation Inter-cluster communications technique for event and health status communications
WO2014110063A1 (en) * 2013-01-09 2014-07-17 Microsoft Corporation Automated failure handling through isolation
WO2016050074A1 (en) * 2014-09-29 2016-04-07 中兴通讯股份有限公司 Cluster split brain processing method and apparatus
US20210349860A1 (en) * 2020-05-07 2021-11-11 Hewlett Packard Enterprise Development Lp Assignment of quora values to nodes based on importance of the nodes
US11544228B2 (en) * 2020-05-07 2023-01-03 Hewlett Packard Enterprise Development Lp Assignment of quora values to nodes based on importance of the nodes

Also Published As

Publication number Publication date
WO2004031979A3 (en) 2004-07-29
US7843811B2 (en) 2010-11-30
DE60318468D1 (en) 2008-02-14
DE60318468T2 (en) 2008-05-21
EP1550036B1 (en) 2008-01-02
EP1550036A2 (en) 2005-07-06
AU2003276045A1 (en) 2004-04-23
US20050268153A1 (en) 2005-12-01
AU2003276045A8 (en) 2004-04-23

Similar Documents

Publication Publication Date Title
EP1550036B1 (en) Method of solving a split-brain condition in a cluster computer system
DE102006048115B4 (en) System and method for recording recoverable errors
US6839752B1 (en) Group data sharing during membership change in clustered computer system
CN110535692B (en) Fault processing method and device, computer equipment, storage medium and storage system
US8615578B2 (en) Using a standby data storage system to detect the health of a cluster of data storage servers
US20070038885A1 (en) Method for operating an arrangement of a plurality of computers in the event of a computer failure
DE112011101705T5 (en) Migrate virtual machines between networked servers after detecting the degradation of network connection functionality
DE102021107655A1 (en) LOG MANAGEMENT FOR A MULTI-NODE DATA PROCESSING SYSTEM
US7499987B2 (en) Deterministically electing an active node
CN114844809B (en) Multi-factor arbitration method and device based on network heartbeat and kernel disk heartbeat
US10754722B1 (en) Method for remotely clearing abnormal status of racks applied in data center
US7428655B2 (en) Smart card for high-availability clustering
US10842041B2 (en) Method for remotely clearing abnormal status of racks applied in data center
EP2110748A2 (en) Cluster control apparatus, control system, control method, and control program
JP2000250833A (en) Operation information acquiring method for operation management of plural servers, and recording medium recorded with program therefor
US8036105B2 (en) Monitoring a problem condition in a communications system
US20080216057A1 (en) Recording medium storing monitoring program, monitoring method, and monitoring system
US20200305300A1 (en) Method for remotely clearing abnormal status of racks applied in data center
US8074109B1 (en) Third-party voting to select a master processor within a multi-processor computer
CN115686831A (en) Task processing method and device based on distributed system, equipment and medium
JPH07168790A (en) Information processor
TW202026879A (en) Method for remotely clearing abnormal status of racks applied in data center
RU2711469C1 (en) Method of remote abnormal state reset of racks used in data center
RU2710288C1 (en) Method of remote abnormal state reset of racks used in data center
CN112214466B (en) Distributed cluster system, data writing method, electronic equipment and storage device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2003798928

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2003798928

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP

WWG Wipo information: grant in national office

Ref document number: 2003798928

Country of ref document: EP