CN111651291B - Method, system and computer storage medium for preventing split brain of shared storage cluster - Google Patents

Method, system and computer storage medium for preventing split brain of shared storage cluster Download PDF

Info

Publication number
CN111651291B
CN111651291B CN202010326284.9A CN202010326284A CN111651291B CN 111651291 B CN111651291 B CN 111651291B CN 202010326284 A CN202010326284 A CN 202010326284A CN 111651291 B CN111651291 B CN 111651291B
Authority
CN
China
Prior art keywords
node
fault
server node
main server
shared storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010326284.9A
Other languages
Chinese (zh)
Other versions
CN111651291A (en
Inventor
宫灿锋
吴坡
张江南
贺勇
任鹏凌
阮冲
王丹
李斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Henan Electric Power Co Ltd
Electric Power Research Institute of State Grid Henan Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Henan Electric Power Co Ltd
Electric Power Research Institute of State Grid Henan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Henan Electric Power Co Ltd, Electric Power Research Institute of State Grid Henan Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202010326284.9A priority Critical patent/CN111651291B/en
Publication of CN111651291A publication Critical patent/CN111651291A/en
Application granted granted Critical
Publication of CN111651291B publication Critical patent/CN111651291B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention relates to a method, a system and a computer storage medium for preventing brain cracking of a shared storage cluster.

Description

Shared storage cluster brain crack prevention method, system and computer storage medium
Technical Field
The present application relates to the field of shared storage cluster technologies, and in particular, to a method, a system, and a computer storage medium for preventing split brain in a shared storage cluster.
Background
The shared storage cluster refers to a server cluster shared storage device, the shared storage device is simultaneously connected with a plurality of servers, user service data are stored in the shared storage device, a main server provides services for the outside and accesses the shared storage device to read and write the data, and once the main server fails (such as shutdown of an operating system, accidental power failure of the server, network failure and the like), the system automatically switches service applications to a standby server and takes over the access rights of the shared storage device to continue the outside service, so that the uninterrupted operation of the service applications is ensured.
Because the servers are connected with each other through the heartbeat line to form the whole server cluster, if the heartbeat between the servers fails, namely the servers cannot mutually detect the heartbeat of the other party within the specified time, the clusters which are originally integrated with each other and coordinated in action can be split into a plurality of independent individuals, and the clusters can respectively start a fault transfer function to acquire the ownership of resources and services, namely the servers are considered to be failed by the other party due to mutual loss of contact, and can instinctively contend for shared storage and application service, and serious consequences can occur; or the shared storage is not shared by the melon and the service; or the services are all up but the shared storage is read and written at the same time, resulting in data corruption.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method aims to solve the problem that data are damaged due to split brains of shared storage clusters in the prior art.
In order to solve the technical problems, the invention provides a method, a system and a computer storage medium for preventing the brain crack of a shared storage cluster.
The method comprises the steps of initializing server nodes in a shared storage cluster, sequencing the positions of the server nodes in the shared storage cluster, carrying out prejudgment detection after a heartbeat network fault occurs, finding out the server nodes with the fault, taking measures in advance, avoiding the split brain condition, ensuring that cluster system data operate consistently and uninterruptedly, and further improving the availability and reliability of the cluster.
The technical scheme adopted by the invention for solving the technical problem is as follows:
the invention provides a method for preventing split brains of shared storage clusters, which comprises the following steps:
initializing and sequencing server nodes in a shared storage cluster;
when a plurality of resource access requests are detected, when the shared storage equipment is judged to be about to have a split brain, a split brain detection mechanism is triggered, a heartbeat detection instruction is sent to the current main server node, and whether the current main server node has a fault or not is judged;
if the current main server node fails, switching the main server and the standby server, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage device.
The second aspect of the present invention provides a system for preventing split brain of a shared storage cluster, comprising: a split brain prevention module disposed in a shared storage device, the split brain prevention module comprising:
the initialization unit is used for carrying out initialization sequencing on the server nodes in the shared storage cluster;
the split brain detection unit is used for triggering a split brain detection mechanism when detecting that the shared storage device is about to have a split brain, sending a heartbeat detection instruction to the current main server node, and judging whether the current main server node has a fault;
and the control unit is used for switching the main server and the standby server if the current main server node fails, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage equipment.
A third aspect of the invention provides a computer storage medium having stored thereon a computer program for implementing the method of the first aspect of the invention when executed by a processor.
The beneficial effects of the invention are: the heartbeat detection module is arranged in a shared storage cluster node, the anti-split brain module is arranged in shared storage equipment, the heartbeat detection module is used for detecting out a heartbeat fault server node in the shared storage cluster, then whether a current main server node is in fault or not is judged through the anti-split brain module, and if the current main server node is in fault, a standby server node is timely switched to serve as a new main server node. The method and the device can determine the server node with the heartbeat failure before the shared storage cluster has the brain crack, and take effective measures for the failed server node, ensure that the main server node running in the shared storage cluster is a normal node, avoid the brain crack phenomenon of the shared storage cluster, and improve the running reliability and the running availability of the shared storage cluster.
Drawings
The technical solution of the present application is further explained below with reference to the drawings and examples.
FIG. 1 is a flow chart of the operation of the anti-split module of an embodiment of the present application;
FIG. 2 is a flow chart of the operation of the heartbeat detection module according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a brain crack prevention system according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The technical solutions of the present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Example 1
The embodiment provides a method for preventing split brain of a shared storage cluster, as shown in fig. 1, including:
s1, performing initialization sequencing on server nodes in a shared storage cluster;
s2, when a plurality of resource access requests are detected, when the shared storage device is judged to be about to have a split brain, a split brain detection mechanism is triggered, a heartbeat detection instruction is sent to the current main server node, and whether the current main server node has a fault or not is judged;
and S3, if the current main server node fails, switching the main server and the standby server, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as the only server node to perform resource interactive access with the shared storage equipment.
The shared storage device of this embodiment may be a shared disk, and a certain storage space is partitioned in the shared disk for setting the anti-split module.
When a heartbeat network in the shared cluster fails, server nodes in the cluster cannot mutually detect heartbeats of the other party within a specified time, so that a fault transfer function is started to acquire ownership of resources and services, namely, a phenomenon of contending for accessing the shared disk and acquiring read-write permission of the disk occurs.
When the anti-split module in the shared disk detects that a plurality of disk access requests exist, the shared disk is judged to be split, and a split detection mechanism is triggered.
Optionally, in this embodiment S1, the step of performing initialization sequencing on the server nodes in the shared storage cluster is:
s11: firstly, sequencing appointed main server nodes, and sequencing the serial numbers of the appointed main server nodes at the first position of a shared storage cluster;
s12: secondly, sequencing the standby server nodes in sequence;
and S13, the IP address of each server node corresponds to the sequenced serial number one by one to generate a node sequencing table.
In this embodiment, 5 server nodes in a shared storage cluster are taken as an example, that is: the main server node is node A, and the standby server nodes are node B, node C, node D and node E.
The result of the initial ordering is: the primary server node a has a sequence number of 1, the backup server node B has a sequence number of 2, the node C has a sequence number of 3, the node D has a sequence number of 4, and the node E has a sequence number of 5.
In this embodiment, the IP address of each server node corresponds to the sorted serial number one by one, and a node sorting table is generated and stored as a basis for switching between the active and standby server nodes.
Optionally, in this embodiment S2, after sending the heartbeat detection instruction, the method further includes the step of generating failure detection information:
s21, after sending a heartbeat detection instruction, if heartbeat detection instruction response information fed back by the main server node is not received within preset time, judging that the main server node has a fault, and simultaneously generating fault detection information containing the IP of the main server node;
s22, if heartbeat detection instruction response information fed back by the main server node is received within preset time, judging that the current main server node is normal, and generating fault detection information by a fault node table composed of fault standby server nodes counted by the main server node, wherein the fault node table comprises the IP of each fault server node.
In order to prevent the brain crack, when detecting that the brain crack is about to occur, the brain crack prevention module in the shared disk sends a heartbeat detection instruction to the current main server node to judge whether the main server node fails.
Each server node is provided with a heartbeat detection module, the heartbeat detection module comprises an Address Resolution Protocol (ARP) table, and the ARP table in the server node comprises heartbeat IP addresses of all server nodes except the server node in the shared storage cluster. ARP is a protocol that resolves an IP address into an ethernet MAC address (or physical address).
After receiving the heartbeat detection instruction, the current main server node sends an ARP heartbeat request message to all standby server nodes in the shared storage cluster by inquiring heartbeat IP addresses of the server nodes in an ARP table, and receives a response message.
If the current main server node does not receive any response message of the standby server node in the shared storage cluster within the preset time, the heartbeat network fault of the current main server node is shown, the heartbeat information of the standby server node cannot be received, and the heartbeat detection instruction response information cannot be fed back.
If the main server node can receive the response message of the standby server node, the main server node is in a normal operation state, the standby server node which does not send the response message, namely the standby server node with the fault, is screened out by detecting the received response message and matching the response message with the IP address in the ARP table, and the standby server node with the fault is counted to generate a fault node table which is fed back to the brain crack prevention module.
And if the anti-split module in the shared disk does not receive heartbeat detection instruction response information fed back by the main server node within preset time, judging that the main server fails to respond to the detection instruction, and generating fault detection information containing the IP of the main server node.
And if the anti-split module in the shared disk receives heartbeat detection instruction response information fed back by the main server node within preset time, judging that the main server is in a normal operation state, and taking a fault node table fed back by the current main server node as fault detection information.
And matching the IP address of the fault server node in the node sequencing table by acquiring fault detection information, if the matched IP is the IP of the current main server, indicating that the current main server node is in fault, needing to perform main-standby switching, and switching the standby server node with the optimal sequencing into a new main server node to enable the new main server node to be used as the only server node to perform resource interactive access with the shared storage equipment.
Optionally, the method for preventing split brain according to this embodiment further includes the step of updating the node sorting table:
s4, matching the IP address of the fault server node in the fault detection information in the node sorting table;
s5, if the matching result is the IP of the main server node, the fault node is the current main server node, the current fault main server node is sequenced to the last of the shared storage cluster, and the sequence number of the current fault main server node in the node sequencing table is updated;
and S6, if the matching result is the IP of the standby server node, indicating that the current main server operates normally, and the fault node is the standby server node, sequentially arranging the fault standby server node at the last of the shared storage cluster, and updating the sequence number of the fault standby server node in the node ordering table.
In this embodiment, according to S4, if the matching result is the IP of the current primary server node, indicating that the current primary server node is faulty, the serial number of the current primary server node is adjusted in the node sorting table, and the specific adjustment process is as follows:
and if the current main server node fails, the current main server node with the fault is sequenced to the position of m +1, the sequencing of other nodes is unchanged, and the sequencing of each updated server node is 2, 3, 4 \8230m, m and m +1.
In this embodiment, taking 5 server nodes initially sorted in S1 as an example, the sequence number of the server node E sorted at the last is 5, if the current primary server node a fails, the sorting of the current primary server node a is updated to 6 in the node sorting table, and the sorting of the last server nodes is: the node B has a sequence number of 2, the node C has a sequence number of 3, the node D has a sequence number of 4, the node E has a sequence number of 5, and the node a has a sequence number of 6.
And after updating the node sequencing table, sending main/standby switching information to a brain crack prevention module of the shared disk, and starting a main/standby server node switching process. Before the main server node and the standby server node are switched, a node sorting table is firstly inquired, a server node with the optimal sorting is selected, in the current node sorting table, the sorting of the node B is optimal, and therefore the node B is selected as a new main server node switching object. After a new main server node is selected, the main-standby switching is started, all the disk resources occupied by the original main server node A are released, the selected new main server node B is allowed to serve as the only server node to access the disk resources, and the main-standby server node switching is completed.
In this embodiment, according to S4, if the matching result is the IP of the standby server node, it indicates that the main server node operates normally, and a heartbeat failure occurs in the standby server node. And updating the sequence number of the backup server node sequence in the node sequence table. The adjustment method comprises the following steps:
and (3) arranging the failed standby server nodes to the last few bits of the whole shared storage cluster, namely if the shared storage cluster has n nodes, the node sequence is 1, 2, 3 and 4 \8230m, wherein m is more than or equal to n, and the number of the standby failed nodes is d, resetting the original serial numbers of the failed standby server nodes, and updating a node sequence table. The updated sequencing sequence numbers of the d failed standby server nodes in the node sequencing table are as follows in sequence: the ordering of m +1, m +2, m +3, 8230, m + d, other server nodes is unchanged.
In this embodiment, taking 5 server nodes initially sorted in S1 as an example, if it is detected that server node B and node C are failed server nodes, according to the sorting method of this embodiment, it is necessary to reset the sorting sequence numbers of server node B and node C in the node sorting table, where the sequence number of the reset node B is 6 and the sequence number of node C is 7.
The ordering of the 5 server nodes in the updated node ordering table is as follows: the node a has a sequence number of 1, the node B has a sequence number of 6, the node C has a sequence number of 7, the node D has a sequence number of 4, and the node E has a sequence number of 5, and the current master server node still has the first sequence.
Because the current main server node does not have a fault, the main/standby switching information does not need to be sent, and the whole shared storage cluster keeps the current main server node to operate intermittently and normally.
According to the embodiment of the invention, the server node with the heartbeat fault can be determined before the shared storage cluster has the split brain, and effective measures are taken for the fault server node, so that the main server node operating in the shared storage cluster is ensured to be a normal node, the split brain phenomenon of the shared storage cluster is avoided, and the operation reliability and the availability of the shared storage cluster are improved.
Example 2:
the embodiment provides a system for preventing split brain of a shared storage cluster, which comprises: a split brain prevention module disposed in a shared storage device, the split brain prevention module comprising:
the initialization unit is used for initializing and sequencing the server nodes in the shared storage cluster;
the system comprises a split brain detection unit, a split brain detection unit and a split brain detection unit, wherein the split brain detection unit is used for triggering a split brain detection mechanism when detecting that the shared storage device is about to have a split brain, sending a heartbeat detection instruction to a current main server node and judging whether the current main server node has a fault or not;
and the switching control unit is used for switching the main server and the standby server if the current main server node fails, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage equipment.
Further, still including setting up the heartbeat detection module in each server node, the heartbeat detection module includes:
the ARP table is used for storing the IP addresses of all the server nodes except the server node in the shared storage cluster;
the heartbeat detection unit is used for sending ARP heartbeat request messages to all standby server nodes in the shared storage cluster by inquiring the ARP table after receiving the heartbeat detection instruction, and receiving response messages;
and the fault node detection unit is used for screening out the standby server nodes which do not send the response messages, namely the standby server nodes which are judged to be in fault, by detecting the response messages, counting the standby server nodes in fault, generating a fault node table and feeding the fault node table back to the split brain detection unit.
In this embodiment, the heartbeat detection unit sends an ARP heartbeat request packet to a standby server node of the shared storage cluster by querying a node IP address of the ARP table, and receives a response packet. If the current main server node does not receive any response message of the standby server node in the cluster within a certain time, it is indicated that the heartbeat network of the current main server node has failed, and the heartbeat information of the standby node cannot be received, the failure node detection unit judges that the current main server node has failed, and feeds back the information of the current main server node failure to the brain fracture detection unit of the disk brain fracture prevention module. If the response message can be received, the main server node is in a normal operation state, the fault node detection process matches the ARP table by detecting the received response message, screens out standby server nodes which do not send response messages, namely fault nodes, and counts the fault nodes to generate a fault node table which is fed back to a brain crack detection unit of the disk brain crack prevention module.
Optionally, in this embodiment, the initialization unit is configured to perform initialization sequencing on each node of the cluster, and first sequence the designated master server node, arrange the sequence number of the master server node at the first position of the shared storage cluster, sequence other backup server nodes in sequence, and generate a node sequencing table, where the IP address of each server node corresponds to the sequencing sequence number of the node one by one.
Optionally, in this embodiment, the split brain detection unit is further configured to generate fault detection information, specifically:
after a heartbeat detection instruction is sent, if heartbeat detection instruction response information fed back by a main server node is not received within preset time, the main server node is judged to be in fault, and fault detection information containing a main server node IP is generated at the same time;
if heartbeat detection instruction response information fed back by the main server node is received within preset time, the current main server node is judged to be normal, meanwhile, fault detection information is generated by a fault node table which is composed of fault standby server nodes counted by the main server node, and the fault node table comprises the IP of each fault server node.
The split brain detection unit of the embodiment is used for detecting whether a shared disk has a split brain, and when the disk receives a plurality of resource access requests, a split brain detection mechanism is triggered to judge that the disk is about to have the split brain. In order to prevent the split brain, a split brain detection process is started, a heartbeat detection instruction is sent to the current main server node, a heartbeat detection module in the current main server node is activated, fault detection information fed back by the current main server node is received, and if the feedback information of the current main server node is not received within preset time, the current main server node is judged to be in fault.
Please refer to example 1 for the specific implementation of the split brain detection unit.
Optionally, the anti-splitting brain module further includes a node information control unit, configured to:
matching the IP address of the fault server node in the fault detection information in the node sorting table;
if the matching result is the IP of the main server node, the fault node is the current main server node, the current fault main server node is sequenced to the last of the shared storage cluster, and the sequence number of the current fault main server node in the node sequencing table is updated;
and if the matching result is the IP of the standby server node, indicating that the current main server operates normally, and the fault node is the standby server node, sequentially arranging the fault standby server node at the last of the shared storage cluster, and updating the sequence number of the fault standby server node in the node sequencing list.
And the node information control unit is used for storing the node sequencing list in the shared storage cluster and updating the sequencing serial number of the fault node in time. And judging whether the sequencing of the cluster main and standby server nodes needs to be changed or not by reading the fault detection information of the split brain detection unit.
If the received fault detection information shows that the main server node normally operates and the standby server node fails, the sequencing sequence number of the standby server node with the fault is found out according to the IP address of the standby server node with the fault and compared with the stored node sequencing list, and the sequence number of the standby server node with the fault is reset, namely the sequence number of the standby server node with the fault is ranked to the last position of the cluster. If a plurality of standby server nodes fail, the standby server nodes are sequentially ranked to the last few bits, the sequence numbers of other nodes are kept unchanged, and the node ranking table is updated.
If the received fault detection information indicates that the main server node has a fault, the serial number of the current main server node is reset according to the IP address of the current main server node and compared with the previously stored node sequencing table, the serial number of the current main server node with the fault is sequenced to the last bit of the cluster, the serial numbers of other nodes are kept unchanged, the node sequencing table is updated, and then the main and standby server switching process of the control unit is started.
For the details of the sorting, please refer to example 1.
The switching control unit of this embodiment is configured to control the shared disk to receive access from a currently and normally operating main server node, and to remove access from a failed server node.
The fault detection information of the split brain detection unit can be transmitted to the node information control unit, the fault detection information can indicate whether the main server node fails, if the fault information of the main server node is received, the main and standby server switching process of the disk control unit can be started, and the main and standby server node can be switched to the standby server node in time; if the received information is that the main server node has no abnormal information, the switching control unit will continue to make the main server node access the disk, and does not start the main/standby server switching process.
Through the main-standby switching mechanism of the switching control unit, only the main server node which normally operates can access the disk resources, and the occurrence of the split disk brain condition is effectively avoided.
Please refer to embodiment 1 for the specific implementation of the switching control unit.
The specific implementation of the brain crack prevention module and the heartbeat detection module of this embodiment are the same as those of the embodiment, and are not described herein again.
In this embodiment, a heartbeat detection module is deployed in a cluster server node, and a brain-crack prevention module is deployed in a shared disk, where the heartbeat detection module mainly detects a heartbeat failure node in a cluster, and then judges whether a main server node fails through the brain-crack prevention module, and if the main server node fails, the main server node is switched to a standby server node in time. Through the anti-split system, the node where the heartbeat fault occurs can be determined before the cluster has split, effective measures are taken for the fault node, the main server node operating in the cluster is ensured to be a normal node, the cluster split phenomenon is avoided, and the reliability and the availability of cluster operation are improved.
Example 3:
the present embodiment provides a computer storage medium, on which a computer program is stored, and the computer program is used for implementing the method of embodiment 1 when executed by a processor.
In light of the foregoing description of the preferred embodiments according to the present application, it is to be understood that various changes and modifications may be made by those skilled in the art without departing from the scope of the invention as defined by the appended claims. The technical scope of the present application is not limited to the contents of the specification, and must be determined according to the scope of the claims.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (4)

1. A method for preventing split brain of a shared storage cluster, comprising:
initializing and sequencing server nodes in the shared storage cluster;
when a plurality of resource access requests are detected, judging that the shared storage equipment is about to have split brain, triggering a split brain detection mechanism, sending a heartbeat detection instruction to the current main server node, and judging whether the current main server node has a fault;
if the current main server node fails, switching the main server and the standby server, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage equipment;
the step of performing initialization sequencing on the server nodes in the shared storage cluster comprises the following steps:
sequencing the appointed main server nodes, and sequencing the serial numbers of the appointed main server nodes at the first position of the shared storage cluster;
sequencing all standby server nodes in sequence;
the IP address of each server node corresponds to the sequenced serial number one by one, and a node sequencing table is generated;
further comprising the step of generating fault detection information:
after a heartbeat detection instruction is sent, if heartbeat detection instruction response information fed back by a main server node is not received within preset time, judging that the current main server node has a fault, and simultaneously generating fault detection information containing a main server node IP;
if heartbeat detection instruction response information fed back by the main server node is received within preset time, judging that the current main server node is normal, and generating fault detection information by a fault node table composed of fault standby server nodes counted by the main server node, wherein the fault node table comprises the IP of each fault server node;
the method also comprises the step of updating the node sorting table:
acquiring the fault detection information, and matching the IP address of the fault server node in the fault detection information in the node sorting table;
if the matching result is the IP of the main server node, the fault node is the current main server node, the current fault main server node is sequenced to the last of the shared storage cluster, and the sequence number of the current fault main server node in the node sequencing table is updated;
and if the matching result is the IP of the standby server node, indicating that the current main server operates normally, and the fault node is the standby server node, sequentially arranging the fault standby server node at the last of the shared storage cluster, and updating the sequence number of the fault standby server node in the node sequencing table.
2. A system for preventing split brain in a shared storage cluster, comprising: a split brain prevention module disposed in a shared storage device, the split brain prevention module comprising:
the initialization unit is used for initializing and sequencing the server nodes in the shared storage cluster;
the split brain detection unit is used for triggering a split brain detection mechanism when detecting that the shared storage device is about to have a split brain, sending a heartbeat detection instruction to the current main server node, and judging whether the current main server node has a fault;
the switching control unit is used for switching the main server and the standby server if the current main server node fails, switching the standby server node with the optimal sequence into a new main server node, and enabling the new main server node to be used as a unique server node to perform resource interactive access with the shared storage equipment;
the initialization unit is further configured to:
the IP addresses of all the server nodes correspond to the corresponding sequencing serial numbers one by one, and a node sequencing table is generated;
the split brain detection unit is further configured to:
after a heartbeat detection instruction is sent, if heartbeat detection instruction response information fed back by a main server node is not received within preset time, the main server node is judged to be in fault, and fault detection information containing a main server node IP is generated at the same time;
if heartbeat detection instruction response information fed back by the main server node is received within preset time, judging that the current main server node is normal, and generating fault detection information by a fault node table composed of fault standby server nodes counted by the main server node, wherein the fault node table comprises the IP of each fault server node;
the anti-split brain module further comprises a node information control unit for:
acquiring the fault detection information, and matching the IP address of the fault server node in the fault detection information in the node sorting table;
if the matching result is the IP of the main server node, the fault node is the current main server node, the current fault main server node is sequenced to the last of the shared storage cluster, and the sequence number of the current fault main server node in the node sequencing table is updated;
and if the matching result is the IP of the standby server node, indicating that the current main server operates normally, and the fault node is the standby server node, sequentially arranging the fault standby server node at the last of the shared storage cluster, and updating the sequence number of the fault standby server node in the node sequencing list.
3. The system of claim 2, further comprising a heartbeat detection module disposed in each server node, the heartbeat detection module comprising:
the ARP table is used for storing the IP addresses of all the server nodes except the server node in the shared storage cluster;
the heartbeat detection unit is used for sending ARP heartbeat request messages to all standby server nodes in the shared storage cluster by inquiring the ARP table after receiving the heartbeat detection instruction, and receiving response messages;
and the fault node detection unit is used for screening out the standby server nodes which do not send out the response messages, namely the standby server nodes which are judged to be in fault, counting the standby server nodes in fault, generating a fault node table and feeding the fault node table back to the fissure detection unit.
4. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, is adapted to carry out the method of claim 1.
CN202010326284.9A 2020-04-23 2020-04-23 Method, system and computer storage medium for preventing split brain of shared storage cluster Active CN111651291B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010326284.9A CN111651291B (en) 2020-04-23 2020-04-23 Method, system and computer storage medium for preventing split brain of shared storage cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010326284.9A CN111651291B (en) 2020-04-23 2020-04-23 Method, system and computer storage medium for preventing split brain of shared storage cluster

Publications (2)

Publication Number Publication Date
CN111651291A CN111651291A (en) 2020-09-11
CN111651291B true CN111651291B (en) 2023-02-03

Family

ID=72346465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010326284.9A Active CN111651291B (en) 2020-04-23 2020-04-23 Method, system and computer storage medium for preventing split brain of shared storage cluster

Country Status (1)

Country Link
CN (1) CN111651291B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112367198B (en) * 2020-10-30 2022-07-01 新华三大数据技术有限公司 Main/standby node switching method and device
CN113220464A (en) * 2021-05-31 2021-08-06 平安科技(深圳)有限公司 Distributed application method and device, computer equipment and storage medium
CN113760607A (en) * 2021-08-31 2021-12-07 云尖信息技术有限公司 Dual-BMC (baseboard management controller) main and standby and data synchronization method
CN115002001B (en) * 2022-02-25 2023-08-04 苏州浪潮智能科技有限公司 Method, device, equipment and medium for detecting sub-health of cluster network
CN114546705B (en) * 2022-02-28 2023-02-07 北京百度网讯科技有限公司 Operation response method, operation response device, electronic apparatus, and storage medium
CN115102924B (en) * 2022-06-25 2023-09-19 平安银行股份有限公司 Cluster address switching method and device, computer equipment and storage medium
CN115269248B (en) * 2022-07-28 2023-08-08 安超云软件有限公司 Method and device for preventing brain fracture under double-node cluster, electronic equipment and storage medium
CN115811461B (en) * 2023-02-08 2023-04-28 湖南国科亿存信息科技有限公司 SAN shared storage cluster brain crack prevention processing method and device and electronic equipment
CN116743550B (en) * 2023-08-11 2023-12-29 之江实验室 Processing method of fault storage nodes of distributed storage cluster

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291243A (en) * 2007-04-16 2008-10-22 广东省新支点技术服务有限公司 Split brain preventing method for highly available cluster system
CN101309167A (en) * 2008-06-27 2008-11-19 华中科技大学 Disaster allowable system and method based on cluster backup
JP2008305353A (en) * 2007-06-11 2008-12-18 Hitachi Ltd Cluster system and fail-over method
CN101582787A (en) * 2008-05-16 2009-11-18 中兴通讯股份有限公司 Double-computer backup system and backup method
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof
CN102868560A (en) * 2012-09-28 2013-01-09 南京恩瑞特实业有限公司 System and method for realizing hot standby of servers
CN103279386A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Method for achieving high availability of computer operation scheduling system
CN105934929A (en) * 2014-12-31 2016-09-07 华为技术有限公司 Post-cluster brain split quorum processing method and quorum storage device and system
CN107147528A (en) * 2017-05-23 2017-09-08 郑州云海信息技术有限公司 One kind stores gateway intelligently anti-fissure system and method
CN107454155A (en) * 2017-07-25 2017-12-08 北京三快在线科技有限公司 A kind of fault handling method based on load balancing cluster, device and system
CN108366086A (en) * 2017-12-25 2018-08-03 聚好看科技股份有限公司 A kind of method and device of control business processing
CN109271280A (en) * 2018-08-30 2019-01-25 重庆富民银行股份有限公司 Storage failure is switched fast processing method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2003276045A1 (en) * 2002-10-07 2004-04-23 Fujitsu Siemens Computers, Inc. Method of solving a split-brain condition
JP2006253900A (en) * 2005-03-09 2006-09-21 Hitachi Ltd Method for ip address takeover, ip-address takeover program, server and network system
US10411948B2 (en) * 2017-08-14 2019-09-10 Nicira, Inc. Cooperative active-standby failover between network systems
TWI686696B (en) * 2018-08-14 2020-03-01 財團法人工業技術研究院 Compute node, failure detection method thereof and cloud data processing system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291243A (en) * 2007-04-16 2008-10-22 广东省新支点技术服务有限公司 Split brain preventing method for highly available cluster system
JP2008305353A (en) * 2007-06-11 2008-12-18 Hitachi Ltd Cluster system and fail-over method
CN101582787A (en) * 2008-05-16 2009-11-18 中兴通讯股份有限公司 Double-computer backup system and backup method
CN101309167A (en) * 2008-06-27 2008-11-19 华中科技大学 Disaster allowable system and method based on cluster backup
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof
CN102868560A (en) * 2012-09-28 2013-01-09 南京恩瑞特实业有限公司 System and method for realizing hot standby of servers
CN103279386A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Method for achieving high availability of computer operation scheduling system
CN105934929A (en) * 2014-12-31 2016-09-07 华为技术有限公司 Post-cluster brain split quorum processing method and quorum storage device and system
CN107147528A (en) * 2017-05-23 2017-09-08 郑州云海信息技术有限公司 One kind stores gateway intelligently anti-fissure system and method
CN107454155A (en) * 2017-07-25 2017-12-08 北京三快在线科技有限公司 A kind of fault handling method based on load balancing cluster, device and system
CN108366086A (en) * 2017-12-25 2018-08-03 聚好看科技股份有限公司 A kind of method and device of control business processing
CN109271280A (en) * 2018-08-30 2019-01-25 重庆富民银行股份有限公司 Storage failure is switched fast processing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
实时数据库系统双机热备机制设计与实现;杨晓芬等;《计算机工程与应用》;20120116(第29期);全文 *

Also Published As

Publication number Publication date
CN111651291A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN111651291B (en) Method, system and computer storage medium for preventing split brain of shared storage cluster
CN106911728A (en) The choosing method and device of host node in distributed system
CN102404390B (en) Intelligent dynamic load balancing method for high-speed real-time database
CN102355369B (en) Virtual clustered system as well as processing method and processing device thereof
US20130036323A1 (en) Fault-tolerant replication architecture
KR101504882B1 (en) Hardware failure mitigation
US9952947B2 (en) Method and system for processing fault of lock server in distributed system
TW200929928A (en) Method and system for assigning a plurality of MACs to a plurality of processors
JP2010045760A (en) Connection recovery device for redundant system, method and processing program
CN113347037B (en) Data center access method and device
CN111031341A (en) Heartbeat-based dual-computer hot standby method
CN112217847A (en) Micro service platform, implementation method thereof, electronic device and storage medium
US7660234B2 (en) Fault-tolerant medium access control (MAC) address assignment in network elements
US8230086B2 (en) Hidden group membership in clustered computer system
CN102187627B (en) Method, device and broadband access server system for load share
CN108509296B (en) Method and system for processing equipment fault
JP2010044553A (en) Data processing method, cluster system, and data processing program
US20160011929A1 (en) Methods for facilitating high availability storage services in virtualized cloud environments and devices thereof
CN107528703B (en) Method and equipment for managing node equipment in distributed system
JP2004030204A (en) Load distribution device and node computer connected to the same
CN111488247A (en) High-availability method and device for managing and controlling multiple fault tolerance of nodes
CN114301763A (en) Distributed cluster fault processing method and system, electronic device and storage medium
US8209405B1 (en) Failover scheme with service-based segregation
CN111934909A (en) Method and device for switching IP (Internet protocol) resources of host and standby machine, computer equipment and storage medium
CN115145782A (en) Server switching method, mooseFS system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant