CN111651291B

CN111651291B - Method, system and computer storage medium for preventing split brain of shared storage cluster

Info

Publication number: CN111651291B
Application number: CN202010326284.9A
Authority: CN
Inventors: 宫灿锋; 吴坡; 张江南; 贺勇; 任鹏凌; 阮冲; 王丹; 李斌
Original assignee: State Grid Corp of China SGCC; State Grid Henan Electric Power Co Ltd; Electric Power Research Institute of State Grid Henan Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Henan Electric Power Co Ltd; Electric Power Research Institute of State Grid Henan Electric Power Co Ltd
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2023-02-03
Anticipated expiration: 2040-04-23
Also published as: CN111651291A

Abstract

The invention relates to a method, a system and a computer storage medium for preventing brain cracking of a shared storage cluster.

Description

Shared storage cluster brain crack prevention method, system and computer storage medium

Technical Field

The present application relates to the field of shared storage cluster technologies, and in particular, to a method, a system, and a computer storage medium for preventing split brain in a shared storage cluster.

Background

The shared storage cluster refers to a server cluster shared storage device, the shared storage device is simultaneously connected with a plurality of servers, user service data are stored in the shared storage device, a main server provides services for the outside and accesses the shared storage device to read and write the data, and once the main server fails (such as shutdown of an operating system, accidental power failure of the server, network failure and the like), the system automatically switches service applications to a standby server and takes over the access rights of the shared storage device to continue the outside service, so that the uninterrupted operation of the service applications is ensured.

Because the servers are connected with each other through the heartbeat line to form the whole server cluster, if the heartbeat between the servers fails, namely the servers cannot mutually detect the heartbeat of the other party within the specified time, the clusters which are originally integrated with each other and coordinated in action can be split into a plurality of independent individuals, and the clusters can respectively start a fault transfer function to acquire the ownership of resources and services, namely the servers are considered to be failed by the other party due to mutual loss of contact, and can instinctively contend for shared storage and application service, and serious consequences can occur; or the shared storage is not shared by the melon and the service; or the services are all up but the shared storage is read and written at the same time, resulting in data corruption.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method aims to solve the problem that data are damaged due to split brains of shared storage clusters in the prior art.

In order to solve the technical problems, the invention provides a method, a system and a computer storage medium for preventing the brain crack of a shared storage cluster.

The method comprises the steps of initializing server nodes in a shared storage cluster, sequencing the positions of the server nodes in the shared storage cluster, carrying out prejudgment detection after a heartbeat network fault occurs, finding out the server nodes with the fault, taking measures in advance, avoiding the split brain condition, ensuring that cluster system data operate consistently and uninterruptedly, and further improving the availability and reliability of the cluster.

The technical scheme adopted by the invention for solving the technical problem is as follows:

the invention provides a method for preventing split brains of shared storage clusters, which comprises the following steps:

initializing and sequencing server nodes in a shared storage cluster;

when a plurality of resource access requests are detected, when the shared storage equipment is judged to be about to have a split brain, a split brain detection mechanism is triggered, a heartbeat detection instruction is sent to the current main server node, and whether the current main server node has a fault or not is judged;

if the current main server node fails, switching the main server and the standby server, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage device.

The second aspect of the present invention provides a system for preventing split brain of a shared storage cluster, comprising: a split brain prevention module disposed in a shared storage device, the split brain prevention module comprising:

the initialization unit is used for carrying out initialization sequencing on the server nodes in the shared storage cluster;

the split brain detection unit is used for triggering a split brain detection mechanism when detecting that the shared storage device is about to have a split brain, sending a heartbeat detection instruction to the current main server node, and judging whether the current main server node has a fault;

and the control unit is used for switching the main server and the standby server if the current main server node fails, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage equipment.

A third aspect of the invention provides a computer storage medium having stored thereon a computer program for implementing the method of the first aspect of the invention when executed by a processor.

The beneficial effects of the invention are: the heartbeat detection module is arranged in a shared storage cluster node, the anti-split brain module is arranged in shared storage equipment, the heartbeat detection module is used for detecting out a heartbeat fault server node in the shared storage cluster, then whether a current main server node is in fault or not is judged through the anti-split brain module, and if the current main server node is in fault, a standby server node is timely switched to serve as a new main server node. The method and the device can determine the server node with the heartbeat failure before the shared storage cluster has the brain crack, and take effective measures for the failed server node, ensure that the main server node running in the shared storage cluster is a normal node, avoid the brain crack phenomenon of the shared storage cluster, and improve the running reliability and the running availability of the shared storage cluster.

Drawings

The technical solution of the present application is further explained below with reference to the drawings and examples.

FIG. 1 is a flow chart of the operation of the anti-split module of an embodiment of the present application;

FIG. 2 is a flow chart of the operation of the heartbeat detection module according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a brain crack prevention system according to an embodiment of the present application.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The technical solutions of the present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Example 1

The embodiment provides a method for preventing split brain of a shared storage cluster, as shown in fig. 1, including:

s1, performing initialization sequencing on server nodes in a shared storage cluster;

s2, when a plurality of resource access requests are detected, when the shared storage device is judged to be about to have a split brain, a split brain detection mechanism is triggered, a heartbeat detection instruction is sent to the current main server node, and whether the current main server node has a fault or not is judged;

and S3, if the current main server node fails, switching the main server and the standby server, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as the only server node to perform resource interactive access with the shared storage equipment.

The shared storage device of this embodiment may be a shared disk, and a certain storage space is partitioned in the shared disk for setting the anti-split module.

When a heartbeat network in the shared cluster fails, server nodes in the cluster cannot mutually detect heartbeats of the other party within a specified time, so that a fault transfer function is started to acquire ownership of resources and services, namely, a phenomenon of contending for accessing the shared disk and acquiring read-write permission of the disk occurs.

When the anti-split module in the shared disk detects that a plurality of disk access requests exist, the shared disk is judged to be split, and a split detection mechanism is triggered.

Optionally, in this embodiment S1, the step of performing initialization sequencing on the server nodes in the shared storage cluster is:

s11: firstly, sequencing appointed main server nodes, and sequencing the serial numbers of the appointed main server nodes at the first position of a shared storage cluster;

s12: secondly, sequencing the standby server nodes in sequence;

and S13, the IP address of each server node corresponds to the sequenced serial number one by one to generate a node sequencing table.

In this embodiment, 5 server nodes in a shared storage cluster are taken as an example, that is: the main server node is node A, and the standby server nodes are node B, node C, node D and node E.

The result of the initial ordering is: the primary server node a has a sequence number of 1, the backup server node B has a sequence number of 2, the node C has a sequence number of 3, the node D has a sequence number of 4, and the node E has a sequence number of 5.

In this embodiment, the IP address of each server node corresponds to the sorted serial number one by one, and a node sorting table is generated and stored as a basis for switching between the active and standby server nodes.

Optionally, in this embodiment S2, after sending the heartbeat detection instruction, the method further includes the step of generating failure detection information:

s21, after sending a heartbeat detection instruction, if heartbeat detection instruction response information fed back by the main server node is not received within preset time, judging that the main server node has a fault, and simultaneously generating fault detection information containing the IP of the main server node;

s22, if heartbeat detection instruction response information fed back by the main server node is received within preset time, judging that the current main server node is normal, and generating fault detection information by a fault node table composed of fault standby server nodes counted by the main server node, wherein the fault node table comprises the IP of each fault server node.

In order to prevent the brain crack, when detecting that the brain crack is about to occur, the brain crack prevention module in the shared disk sends a heartbeat detection instruction to the current main server node to judge whether the main server node fails.

Each server node is provided with a heartbeat detection module, the heartbeat detection module comprises an Address Resolution Protocol (ARP) table, and the ARP table in the server node comprises heartbeat IP addresses of all server nodes except the server node in the shared storage cluster. ARP is a protocol that resolves an IP address into an ethernet MAC address (or physical address).

After receiving the heartbeat detection instruction, the current main server node sends an ARP heartbeat request message to all standby server nodes in the shared storage cluster by inquiring heartbeat IP addresses of the server nodes in an ARP table, and receives a response message.

If the current main server node does not receive any response message of the standby server node in the shared storage cluster within the preset time, the heartbeat network fault of the current main server node is shown, the heartbeat information of the standby server node cannot be received, and the heartbeat detection instruction response information cannot be fed back.

If the main server node can receive the response message of the standby server node, the main server node is in a normal operation state, the standby server node which does not send the response message, namely the standby server node with the fault, is screened out by detecting the received response message and matching the response message with the IP address in the ARP table, and the standby server node with the fault is counted to generate a fault node table which is fed back to the brain crack prevention module.

And if the anti-split module in the shared disk does not receive heartbeat detection instruction response information fed back by the main server node within preset time, judging that the main server fails to respond to the detection instruction, and generating fault detection information containing the IP of the main server node.

And if the anti-split module in the shared disk receives heartbeat detection instruction response information fed back by the main server node within preset time, judging that the main server is in a normal operation state, and taking a fault node table fed back by the current main server node as fault detection information.

And matching the IP address of the fault server node in the node sequencing table by acquiring fault detection information, if the matched IP is the IP of the current main server, indicating that the current main server node is in fault, needing to perform main-standby switching, and switching the standby server node with the optimal sequencing into a new main server node to enable the new main server node to be used as the only server node to perform resource interactive access with the shared storage equipment.

Optionally, the method for preventing split brain according to this embodiment further includes the step of updating the node sorting table:

s4, matching the IP address of the fault server node in the fault detection information in the node sorting table;

s5, if the matching result is the IP of the main server node, the fault node is the current main server node, the current fault main server node is sequenced to the last of the shared storage cluster, and the sequence number of the current fault main server node in the node sequencing table is updated;

and S6, if the matching result is the IP of the standby server node, indicating that the current main server operates normally, and the fault node is the standby server node, sequentially arranging the fault standby server node at the last of the shared storage cluster, and updating the sequence number of the fault standby server node in the node ordering table.

In this embodiment, according to S4, if the matching result is the IP of the current primary server node, indicating that the current primary server node is faulty, the serial number of the current primary server node is adjusted in the node sorting table, and the specific adjustment process is as follows:

and if the current main server node fails, the current main server node with the fault is sequenced to the position of m +1, the sequencing of other nodes is unchanged, and the sequencing of each updated server node is 2, 3, 4 \8230m, m and m +1.

In this embodiment, taking 5 server nodes initially sorted in S1 as an example, the sequence number of the server node E sorted at the last is 5, if the current primary server node a fails, the sorting of the current primary server node a is updated to 6 in the node sorting table, and the sorting of the last server nodes is: the node B has a sequence number of 2, the node C has a sequence number of 3, the node D has a sequence number of 4, the node E has a sequence number of 5, and the node a has a sequence number of 6.

And after updating the node sequencing table, sending main/standby switching information to a brain crack prevention module of the shared disk, and starting a main/standby server node switching process. Before the main server node and the standby server node are switched, a node sorting table is firstly inquired, a server node with the optimal sorting is selected, in the current node sorting table, the sorting of the node B is optimal, and therefore the node B is selected as a new main server node switching object. After a new main server node is selected, the main-standby switching is started, all the disk resources occupied by the original main server node A are released, the selected new main server node B is allowed to serve as the only server node to access the disk resources, and the main-standby server node switching is completed.

In this embodiment, according to S4, if the matching result is the IP of the standby server node, it indicates that the main server node operates normally, and a heartbeat failure occurs in the standby server node. And updating the sequence number of the backup server node sequence in the node sequence table. The adjustment method comprises the following steps:

and (3) arranging the failed standby server nodes to the last few bits of the whole shared storage cluster, namely if the shared storage cluster has n nodes, the node sequence is 1, 2, 3 and 4 \8230m, wherein m is more than or equal to n, and the number of the standby failed nodes is d, resetting the original serial numbers of the failed standby server nodes, and updating a node sequence table. The updated sequencing sequence numbers of the d failed standby server nodes in the node sequencing table are as follows in sequence: the ordering of m +1, m +2, m +3, 8230, m + d, other server nodes is unchanged.

In this embodiment, taking 5 server nodes initially sorted in S1 as an example, if it is detected that server node B and node C are failed server nodes, according to the sorting method of this embodiment, it is necessary to reset the sorting sequence numbers of server node B and node C in the node sorting table, where the sequence number of the reset node B is 6 and the sequence number of node C is 7.

The ordering of the 5 server nodes in the updated node ordering table is as follows: the node a has a sequence number of 1, the node B has a sequence number of 6, the node C has a sequence number of 7, the node D has a sequence number of 4, and the node E has a sequence number of 5, and the current master server node still has the first sequence.

Because the current main server node does not have a fault, the main/standby switching information does not need to be sent, and the whole shared storage cluster keeps the current main server node to operate intermittently and normally.

According to the embodiment of the invention, the server node with the heartbeat fault can be determined before the shared storage cluster has the split brain, and effective measures are taken for the fault server node, so that the main server node operating in the shared storage cluster is ensured to be a normal node, the split brain phenomenon of the shared storage cluster is avoided, and the operation reliability and the availability of the shared storage cluster are improved.

Example 2:

the embodiment provides a system for preventing split brain of a shared storage cluster, which comprises: a split brain prevention module disposed in a shared storage device, the split brain prevention module comprising:

the initialization unit is used for initializing and sequencing the server nodes in the shared storage cluster;

the system comprises a split brain detection unit, a split brain detection unit and a split brain detection unit, wherein the split brain detection unit is used for triggering a split brain detection mechanism when detecting that the shared storage device is about to have a split brain, sending a heartbeat detection instruction to a current main server node and judging whether the current main server node has a fault or not;

and the switching control unit is used for switching the main server and the standby server if the current main server node fails, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage equipment.

Further, still including setting up the heartbeat detection module in each server node, the heartbeat detection module includes:

the ARP table is used for storing the IP addresses of all the server nodes except the server node in the shared storage cluster;

the heartbeat detection unit is used for sending ARP heartbeat request messages to all standby server nodes in the shared storage cluster by inquiring the ARP table after receiving the heartbeat detection instruction, and receiving response messages;

and the fault node detection unit is used for screening out the standby server nodes which do not send the response messages, namely the standby server nodes which are judged to be in fault, by detecting the response messages, counting the standby server nodes in fault, generating a fault node table and feeding the fault node table back to the split brain detection unit.

In this embodiment, the heartbeat detection unit sends an ARP heartbeat request packet to a standby server node of the shared storage cluster by querying a node IP address of the ARP table, and receives a response packet. If the current main server node does not receive any response message of the standby server node in the cluster within a certain time, it is indicated that the heartbeat network of the current main server node has failed, and the heartbeat information of the standby node cannot be received, the failure node detection unit judges that the current main server node has failed, and feeds back the information of the current main server node failure to the brain fracture detection unit of the disk brain fracture prevention module. If the response message can be received, the main server node is in a normal operation state, the fault node detection process matches the ARP table by detecting the received response message, screens out standby server nodes which do not send response messages, namely fault nodes, and counts the fault nodes to generate a fault node table which is fed back to a brain crack detection unit of the disk brain crack prevention module.

Optionally, in this embodiment, the initialization unit is configured to perform initialization sequencing on each node of the cluster, and first sequence the designated master server node, arrange the sequence number of the master server node at the first position of the shared storage cluster, sequence other backup server nodes in sequence, and generate a node sequencing table, where the IP address of each server node corresponds to the sequencing sequence number of the node one by one.

Optionally, in this embodiment, the split brain detection unit is further configured to generate fault detection information, specifically:

after a heartbeat detection instruction is sent, if heartbeat detection instruction response information fed back by a main server node is not received within preset time, the main server node is judged to be in fault, and fault detection information containing a main server node IP is generated at the same time;

if heartbeat detection instruction response information fed back by the main server node is received within preset time, the current main server node is judged to be normal, meanwhile, fault detection information is generated by a fault node table which is composed of fault standby server nodes counted by the main server node, and the fault node table comprises the IP of each fault server node.

The split brain detection unit of the embodiment is used for detecting whether a shared disk has a split brain, and when the disk receives a plurality of resource access requests, a split brain detection mechanism is triggered to judge that the disk is about to have the split brain. In order to prevent the split brain, a split brain detection process is started, a heartbeat detection instruction is sent to the current main server node, a heartbeat detection module in the current main server node is activated, fault detection information fed back by the current main server node is received, and if the feedback information of the current main server node is not received within preset time, the current main server node is judged to be in fault.

Please refer to example 1 for the specific implementation of the split brain detection unit.

Optionally, the anti-splitting brain module further includes a node information control unit, configured to:

matching the IP address of the fault server node in the fault detection information in the node sorting table;

if the matching result is the IP of the main server node, the fault node is the current main server node, the current fault main server node is sequenced to the last of the shared storage cluster, and the sequence number of the current fault main server node in the node sequencing table is updated;

and if the matching result is the IP of the standby server node, indicating that the current main server operates normally, and the fault node is the standby server node, sequentially arranging the fault standby server node at the last of the shared storage cluster, and updating the sequence number of the fault standby server node in the node sequencing list.

And the node information control unit is used for storing the node sequencing list in the shared storage cluster and updating the sequencing serial number of the fault node in time. And judging whether the sequencing of the cluster main and standby server nodes needs to be changed or not by reading the fault detection information of the split brain detection unit.

If the received fault detection information shows that the main server node normally operates and the standby server node fails, the sequencing sequence number of the standby server node with the fault is found out according to the IP address of the standby server node with the fault and compared with the stored node sequencing list, and the sequence number of the standby server node with the fault is reset, namely the sequence number of the standby server node with the fault is ranked to the last position of the cluster. If a plurality of standby server nodes fail, the standby server nodes are sequentially ranked to the last few bits, the sequence numbers of other nodes are kept unchanged, and the node ranking table is updated.

If the received fault detection information indicates that the main server node has a fault, the serial number of the current main server node is reset according to the IP address of the current main server node and compared with the previously stored node sequencing table, the serial number of the current main server node with the fault is sequenced to the last bit of the cluster, the serial numbers of other nodes are kept unchanged, the node sequencing table is updated, and then the main and standby server switching process of the control unit is started.

For the details of the sorting, please refer to example 1.

The switching control unit of this embodiment is configured to control the shared disk to receive access from a currently and normally operating main server node, and to remove access from a failed server node.

The fault detection information of the split brain detection unit can be transmitted to the node information control unit, the fault detection information can indicate whether the main server node fails, if the fault information of the main server node is received, the main and standby server switching process of the disk control unit can be started, and the main and standby server node can be switched to the standby server node in time; if the received information is that the main server node has no abnormal information, the switching control unit will continue to make the main server node access the disk, and does not start the main/standby server switching process.

Through the main-standby switching mechanism of the switching control unit, only the main server node which normally operates can access the disk resources, and the occurrence of the split disk brain condition is effectively avoided.

Please refer to embodiment 1 for the specific implementation of the switching control unit.

The specific implementation of the brain crack prevention module and the heartbeat detection module of this embodiment are the same as those of the embodiment, and are not described herein again.

In this embodiment, a heartbeat detection module is deployed in a cluster server node, and a brain-crack prevention module is deployed in a shared disk, where the heartbeat detection module mainly detects a heartbeat failure node in a cluster, and then judges whether a main server node fails through the brain-crack prevention module, and if the main server node fails, the main server node is switched to a standby server node in time. Through the anti-split system, the node where the heartbeat fault occurs can be determined before the cluster has split, effective measures are taken for the fault node, the main server node operating in the cluster is ensured to be a normal node, the cluster split phenomenon is avoided, and the reliability and the availability of cluster operation are improved.

Example 3:

the present embodiment provides a computer storage medium, on which a computer program is stored, and the computer program is used for implementing the method of embodiment 1 when executed by a processor.

In light of the foregoing description of the preferred embodiments according to the present application, it is to be understood that various changes and modifications may be made by those skilled in the art without departing from the scope of the invention as defined by the appended claims. The technical scope of the present application is not limited to the contents of the specification, and must be determined according to the scope of the claims.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method for preventing split brain of a shared storage cluster, comprising:

initializing and sequencing server nodes in the shared storage cluster;

when a plurality of resource access requests are detected, judging that the shared storage equipment is about to have split brain, triggering a split brain detection mechanism, sending a heartbeat detection instruction to the current main server node, and judging whether the current main server node has a fault;

if the current main server node fails, switching the main server and the standby server, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage equipment;

the step of performing initialization sequencing on the server nodes in the shared storage cluster comprises the following steps:

sequencing the appointed main server nodes, and sequencing the serial numbers of the appointed main server nodes at the first position of the shared storage cluster;

sequencing all standby server nodes in sequence;

the IP address of each server node corresponds to the sequenced serial number one by one, and a node sequencing table is generated;

further comprising the step of generating fault detection information:

after a heartbeat detection instruction is sent, if heartbeat detection instruction response information fed back by a main server node is not received within preset time, judging that the current main server node has a fault, and simultaneously generating fault detection information containing a main server node IP;

if heartbeat detection instruction response information fed back by the main server node is received within preset time, judging that the current main server node is normal, and generating fault detection information by a fault node table composed of fault standby server nodes counted by the main server node, wherein the fault node table comprises the IP of each fault server node;

the method also comprises the step of updating the node sorting table:

acquiring the fault detection information, and matching the IP address of the fault server node in the fault detection information in the node sorting table;

and if the matching result is the IP of the standby server node, indicating that the current main server operates normally, and the fault node is the standby server node, sequentially arranging the fault standby server node at the last of the shared storage cluster, and updating the sequence number of the fault standby server node in the node sequencing table.

2. A system for preventing split brain in a shared storage cluster, comprising: a split brain prevention module disposed in a shared storage device, the split brain prevention module comprising:

the switching control unit is used for switching the main server and the standby server if the current main server node fails, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage equipment;

the initialization unit is further configured to:

the IP addresses of all the server nodes correspond to the corresponding sequencing serial numbers one by one, and a node sequencing table is generated;

the split brain detection unit is further configured to:

the anti-split brain module further comprises a node information control unit for:

3. The system of claim 2, further comprising a heartbeat detection module disposed in each server node, the heartbeat detection module comprising:

and the fault node detection unit is used for screening out the standby server nodes which do not send out the response messages, namely the standby server nodes which are judged to be in fault, counting the standby server nodes in fault, generating a fault node table and feeding the fault node table back to the fissure detection unit.

4. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, is adapted to carry out the method of claim 1.