CN109495312B

CN109495312B - Method and system for realizing high-availability cluster based on arbitration disk and double links

Info

Publication number: CN109495312B
Application number: CN201811479176.4A
Authority: CN
Inventors: 郑伟; 陈鹏; 王子骏
Original assignee: GUANGZHOU DINGJIA COMPUTER TECHNOLOGY Co Ltd
Current assignee: GUANGZHOU DINGJIA COMPUTER TECHNOLOGY Co Ltd
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2020-01-17
Anticipated expiration: 2038-12-05
Also published as: CN109495312A

Abstract

The invention relates to a method and a system for realizing a high-availability cluster based on an arbitration disk and a double link, belonging to the technical field of networks. The method comprises the following steps: when the first heartbeat network is detected to be abnormal, determining the state of a second heartbeat network; when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network, determining the state of a storage network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server; and controlling the node server to perform active-standby switching according to the state of the storage network. By the technical scheme, the problem that the main/standby switching of the high-availability cluster is disordered is solved. The method can effectively reduce the phenomenon of error switching caused by abnormal heartbeat of the node server and ensure the normal operation of the main-standby switching in the high-availability cluster.

Description

Method and system for realizing high-availability cluster based on arbitration disk and double links

Technical Field

The present invention relates to the field of network technologies, and in particular, to a method, a system, a computer device, and a storage medium for implementing a high-availability cluster based on an arbitration disk and a dual link.

Background

A High Availability (HA) cluster refers to a server clustering technique for the purpose of reducing service interruption time. The method reduces the influence of the application server on the business due to human/software/hardware and the like to the minimum degree by protecting the services which are continuously provided by the application service of the user, such as database service, webpage service and the like. The existing high-availability cluster technology mainly monitors the state of a server through a heartbeat network.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: at present, all nodes in the high-availability cluster are in a master-slave relationship with each other, and are connected through a private network. When the private network is abnormal, corresponding switching or repairing can be carried out. However, in this case, the false switching is easy to occur, which causes phenomena such as split brain, and makes the active/standby switching of the high-availability cluster more disordered.

Disclosure of Invention

Based on this, the embodiments of the present invention provide a method, a system, a computer device, and a storage medium for implementing a high-availability cluster based on an arbitration disk and a dual link, which can ensure normal operation of active-standby switching in the high-availability cluster.

The content of the embodiment of the invention is as follows:

a method for realizing a high-availability cluster based on an arbitration disk and a double link comprises the following steps: when the first heartbeat network is detected to be abnormal, determining the state of a second heartbeat network; when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network, determining the state of a storage network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server; and controlling the node server to perform active-standby switching according to the state of the storage network.

In one embodiment, the step of controlling the node server to perform active/standby switching according to the state of the storage network includes: when determining that the storage network is abnormal according to the state of the storage network, judging that the current main node server is down, determining a new main node server and performing main-standby switching; when the storage network is determined to be normal according to the state of the storage network, judging that the heartbeat network is abnormal; and unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

In one embodiment, after the step of determining the state of the second heartbeat network when the abnormality of the first heartbeat network is detected, the method further includes: and when the second heartbeat network is determined to be normal according to the state of the second heartbeat network, controlling and repairing the current active node server.

In one embodiment, the step of controlling to repair the current active node server includes: sending a repair instruction to the current main node server; the repair instruction is used for controlling the current main node server to repair resources; acquiring the repair state of the current main node server; and when the current node main server is determined to fail to be repaired according to the repair state, unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

In one embodiment, the step of determining a new active node server and performing active-standby switching includes: selecting a standby node server as a new main node server; and degrading the current main node server into a standby node server, and upgrading the new main node server into a main node server.

In one embodiment, after the step of upgrading the new active node server to the active node server, the method further includes: sending a heartbeat updating instruction to the new main node server; the heartbeat updating instruction is used for controlling the new main node server to update and store heartbeat information; the storage heartbeat is a heartbeat corresponding to the storage server; the first heartbeat network, the second heartbeat network and/or the storage network are monitored.

In one embodiment, before the step of determining the state of the second heartbeat network when the abnormality of the first heartbeat network is detected, the method further includes: sending a heartbeat network connection instruction to the connected node server; the heartbeat network connection instruction is used for controlling the node server to connect the first heartbeat network and the second heartbeat network.

In one embodiment, before the step of determining the state of the storage network when determining that the second heartbeat network is abnormal according to the state of the second heartbeat network, the method further includes: sending a disk partitioning instruction to the storage server; the disk dividing instruction is used for controlling the storage server to divide a specific disk space as an arbitration disk; establishing a storage network according to the arbitration disk, and connecting the storage network; sending a storage network connection instruction to the connected node server; the storage network connection instruction is used for controlling the node server to connect the storage network.

Correspondingly, an embodiment of the present invention provides a high availability cluster system based on an arbitration disk and a dual link, including: the heartbeat network state determining module is used for determining the state of the second heartbeat network when the first heartbeat network is detected to be abnormal; the storage network state determining module is used for determining the state of the storage network when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server; and the active-standby switching module is used for controlling the node server to switch the active-standby according to the state of the storage network.

According to the implementation method and system of the high-availability cluster based on the arbitration disk and the double links, whether the heartbeat network of the high-availability cluster is abnormal or not is determined according to the states of the two heartbeat networks; and when the heartbeat network is really abnormal, controlling the active-standby switching of the node server according to the state of the storage network. The method can effectively reduce the phenomenon of error switching caused by abnormal heartbeat of the node server and ensure the normal operation of the main-standby switching in the high-availability cluster.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: when the first heartbeat network is detected to be abnormal, determining the state of a second heartbeat network; when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network, determining the state of a storage network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server; and controlling the node server to perform active-standby switching according to the state of the storage network.

The computer equipment can effectively reduce the error switching phenomenon caused by abnormal heartbeat of the node server and ensure the normal operation of the main-standby switching in the high-availability cluster.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: when the first heartbeat network is detected to be abnormal, determining the state of a second heartbeat network; when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network, determining the state of a storage network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server; and controlling the node server to perform active-standby switching according to the state of the storage network.

The computer readable storage medium can effectively reduce the phenomenon of false switching caused by abnormal heartbeat of the node server and ensure the normal operation of main-standby switching in the high-availability cluster.

Drawings

FIG. 1 is a diagram of an embodiment of an application environment for a method for implementing a high availability cluster based on an arbitration disk and a dual link;

FIG. 2 is a flow chart illustrating a method for implementing a high availability cluster based on an arbitration disk and a dual link according to an embodiment;

FIG. 3 is a schematic structural diagram of a system for implementing a high availability cluster based on an arbitration disk and a dual link according to an embodiment;

FIG. 4 is a schematic flow chart of a method for implementing a high availability cluster based on an arbitration disk and a dual link in another embodiment;

FIG. 5 is a block diagram of a high availability cluster system based on an arbitration disk and a dual link in another embodiment;

FIG. 6 shows an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The implementation method of the high-availability cluster based on the arbitration disk and the double links can be applied to the application environment shown in fig. 1. The application environment includes a control server 101 (which may also be referred to as a high availability server), a node server 102, a storage server 103, a first heartbeat network 104, and a second heartbeat network 105. These servers may be connected via an IP network. In practical application, the control server detects network states of the first heartbeat network and the second heartbeat network, detects a storage network corresponding to the storage server when the first heartbeat network and the second heartbeat network are abnormal, and switches the node server between the master node and the standby node according to the states of the storage networks. The control server, the node server and the storage server can be realized by independent servers or a server cluster consisting of a plurality of servers; the first heartbeat network and the second heartbeat network may be implemented by a private network or a public network.

In one embodiment, the control server mainly implements the following functions: 1. as a centralized configuration management entrance, a WEB/graphical interface is provided for configuring and managing the high-availability clusters. 2. The control storage server divides the storage space and generates an arbitration disk, initializes the arbitration disk and divides the use space of the storage server. 3. And monitoring the primary node server to ensure that the primary node server can normally run the corresponding application service. 4. When the primary node server is abnormal or fails, a standby node server is selected as a new primary node server and an instruction is issued to inform the new primary node server to take over the service.

The storage server mainly provides the following functions: 1. and receiving an instruction of the control server to divide the storage space. 2. And controlling the access authority to ensure that the node server can normally access the storage server.

The node server mainly realizes the following functions: 1. and receiving an instruction of the control server to monitor and upload the storage heartbeat and the network heartbeat. 2. And receiving an instruction of the control server to take over the application service to become a new main node server.

The embodiment of the invention provides a method, a system, computer equipment and a storage medium for realizing a high-availability cluster based on an arbitration disk and double links. The following are detailed below.

In one embodiment, as shown in FIG. 2, a method for implementing a high availability cluster based on an arbitration disk and a dual link is provided. Taking the application of the method to the control server in fig. 1 as an example for explanation, the method includes the following steps:

s201, when the first heartbeat network is detected to be abnormal, the state of the second heartbeat network is determined.

The heartbeat network (including the first heartbeat network and the second heartbeat network) refers to a network which can ensure that all node servers in the cluster can provide operations such as heartbeat information, resource state information and abnormal resource report to the HA server, and can be realized through a private network. The state of the heartbeat network may refer to the access state of the node server to the heartbeat network or the network state of the heartbeat network itself. For example, when the primary node server S connected to the heartbeat network a cannot normally access the heartbeat network a (at this time, the primary node server S may have resource abnormality), the state of the heartbeat network appears as abnormality; the master node server S can normally access the heartbeat network a, and when the network state of the heartbeat network a itself fails, the state of the heartbeat network a also appears abnormal. In addition, the heartbeat network and the service network (network used when the control server, the node server, and the storage server perform service processing) are both local area networks or wide area networks in nature. The heartbeat network can be isolated from the service network, so that the condition that the HA server cannot correctly judge the state of the node server due to the abnormal heartbeat network of all nodes in the cluster when the service network fails is prevented.

There are at least 2 reasons for the first heartbeat network anomaly: 1. the node server has resource abnormality, which causes heartbeat abnormality of the node server and further causes a first heartbeat network abnormality; 2. the heartbeat network is abnormal. When only the heartbeat network fails, if the storage and service of the original master node server are not cleared during the master-slave switching, the situation that two processes access the same file at the same time after the switching may be caused, and the new master node is inaccessible due to inconsistent data may also occur. Therefore, when only the heartbeat network is abnormal, some resources in the node server may be normally accessed, and at this time, all the resources on the primary node need to be stopped and then switched, otherwise, some resources may be in an accessible state on both nodes at the same time, which may result in unpredictable results. Based on this, it is necessary to determine whether the abnormality of the first heartbeat network is a node heartbeat abnormality or a heartbeat network abnormality according to the state of the second heartbeat network, thereby reducing the risk of miscut.

It should be noted that, in the embodiment of the present invention, the node server is also referred to as a node. In addition, the two groups of heartbeat networks in the embodiment of the invention can be called as a double-heartbeat network, a double-heartbeat link and the like.

S202, when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network, determining the state of a storage network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server.

The node server may also be referred to as an application server, which refers to a server providing an application service; the node servers can be multiple, and can comprise an active node server and a standby node server. When the current primary node server fails, the control server selects one standby server as a new primary node server. The node server communicates with the control server over the private network and maintains the heartbeat.

The storage server can divide a certain space as an arbitration disk, and a corresponding storage network is established according to the storage server. The arbitrated disk may be FC-SAN/IP-SAN/NAS or the like, and the corresponding storage network may be referred to as a shared storage network. Both the node server and the control server may be connected to a storage server. After the node server establishes connection with the storage server, the node server can update the storage heartbeat of the node server according to the storage server so as to upload the storage heartbeat to the storage server. When the arbitration disk receives the storage heartbeat of the node server, the arbitration disk can know whether the corresponding node server is online or not. That is, the state of the storage network may refer to a heartbeat state of a node server to which the storage server is connected; when the storage server can receive the storage heartbeat of the node server, the storage network is considered to be normal, and the corresponding node server is online; when the storage server cannot receive the storage heartbeat of the node server, the storage network is considered to be abnormal, and the corresponding node server is offline (down).

Further, when the first heartbeat network and the second heartbeat network are both abnormal, the state detection of the arbitration disk can be performed, that is, the control server detects the state of the storage network to determine whether the currently-operating active node server is down, so as to determine whether the active-standby switching needs to be performed after the resources of the active node server are cleared, thereby reducing the risk of split brain.

S203, controlling the node server to perform active-standby switching according to the state of the storage network.

The active-standby switching of the node server may refer to degrading the current active node server into a standby node server; and one of the original standby node servers is selected as a new main node server.

In this step, the control server controls the active-standby switching of the node server according to the state of the storage network. The phenomenon that resource data accessible by a plurality of nodes are asynchronous or the resource is abnormal can be prevented.

The existing high-availability cluster technology mainly monitors the state of a server through a heartbeat network. All the nodes are in a master-slave relationship with each other and are connected through a private network, each node detects the resource state of the node and broadcasts the resource state to all the nodes in the cluster through the private heartbeat network, and if the resource is abnormal, heartbeat information changes. After the other nodes capture the change, corresponding switching or repairing is carried out. However, the heartbeat monitoring technique has the following problems: 1. there is a risk of brain cracks. When a single monitoring environment is abnormal in a short time, the cluster is switched, but the fault environment is recovered quickly; the resource data can be out of synchronization or abnormal, and split brain can be generated. 2. There is a risk that a false failover will occur. When the heartbeat of the node is abnormal instantaneously and then returns to normal, the switching can be caused due to the change or interruption of the heartbeat information, but the switching operation in this case is wrong. 3. Configuration and management are complex. There is no uniform management portal within the high availability cluster, and the cluster needs to be connected to each node server for high availability configuration. The node server can be configured to be highly available before the node server is accessed to the high-availability cluster; the configuration for the node server may include: configuring resource information monitored by the nodes, an expected resource initial state of the nodes, a heartbeat detection mode of the nodes, a fault processing mechanism, some basic configurations required by the HA agents, and the like.

According to the embodiment, the phenomenon of mistaken switching caused by abnormal heartbeat of the node server can be effectively reduced, and normal operation of main-standby switching in the high-availability cluster is ensured. Meanwhile, the control server is used as a uniform configuration management inlet, and a high-availability cluster is configured in a centralized mode; the configuration and management of the high-availability cluster can be effectively simplified.

In one embodiment, before detecting the states of the first heartbeat network and the second heartbeat network, the method may further include the steps of creating a heartbeat network and connecting the heartbeat network. Namely before detecting the state of the first heartbeat network, the method further comprises the following steps: sending a heartbeat network connection instruction to the connected node server; the heartbeat network connection instruction is used for controlling the node server to connect the first heartbeat network and the second heartbeat network.

Specifically, the implementation process of establishing and connecting the heartbeat network may be: two sets of private networks (such as private networks a and B in fig. 3, and fig. 3 is a schematic structural diagram of an implementation system of a high availability cluster based on an arbitration disk and a double link) are established, and the two sets of networks are independent of each other. All the node servers and the control servers in the cluster are added into the two private networks, and all the servers in the cluster can normally communicate through the two groups of private networks.

The node server may send a network heartbeat to one of the heartbeat networks after establishing a connection with the private network. The control server can indirectly acquire the state of the node server by monitoring the state of the heartbeat.

The node server detects whether the resource state of the node server is an expected state during initialization in real time. And when the resource state changes, the resource abnormal information is sent by the heartbeat. At this time, the corresponding heartbeat network is abnormal. The resources of the node server refer to IP, storage, service, and application programs monitored in the HA task (monitoring services registered by the application). And the resource exception refers to the resource state exception, including the conditions of IP inaccessibility, storage inaccessibility, abnormal service state and the like. The node server has resource abnormality, which can also be regarded as application service failure.

In this embodiment, after two sets of heartbeat networks are established, all the node servers and the control servers are controlled to connect with the heartbeat network, so that a dual-heartbeat confirmation mechanism can be implemented, and whether the abnormality of the heartbeat network is caused by the network abnormality or the node server abnormality can be correctly identified.

In one embodiment, when determining that the second heartbeat network is abnormal according to the state of the second heartbeat network, the determining the state of the storage network may further include, before the step of determining the state of the storage network: sending a disk partitioning instruction to the storage server; the disk dividing instruction is used for controlling the storage server to divide a specific disk space as an arbitration disk; establishing a storage network according to the arbitration disk, and connecting the storage network; sending a storage network connection instruction to the connected node server; the storage network connection instruction is used for controlling the node server to connect the storage network. The control server may include a step of formatting the disk space corresponding to the arbitration disk after the disk space corresponding to the arbitration disk is partitioned from the storage server.

The control server can also send an initialization instruction to the node server, so that the node server initializes the arbitration disk and initializes the storage heartbeat, and sends the storage heartbeat to the control server. Accordingly, the node server can communicate with the control server through the shared storage and keep storing heartbeats. In addition, after the node server and the control server are accessed to the storage server, normal reading and writing of the arbitration disk need to be guaranteed.

After all the node servers and the control server are connected with the heartbeat network and the storage network, the state monitoring of application services (including services, storage and networks) and the state monitoring of the node servers can be carried out through the existing high-availability cluster method, and the normal operation of the node servers is ensured.

In this embodiment, the arbitration disk is divided for the storage servers, the corresponding storage networks are established, the connection state of the connected node service can be acquired in real time, and the state of the corresponding node server is determined by the storage heartbeats of the node server when both the two-heartbeat network is abnormal.

In an embodiment, the step of controlling the node server to perform active/standby switching according to the state of the storage network includes: when determining that the storage network is abnormal according to the state of the storage network, judging that the current main node server is down, determining a new main node server and performing main-standby switching; when the storage network is determined to be normal according to the state of the storage network, judging that the heartbeat network is abnormal; and unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

If the current active node server is down, the active node server is offline, and at this time, even if the active-standby switching is performed, the situation that resources are accessible on both nodes does not exist. Therefore, a new primary node server can be directly determined and primary-secondary switching can be carried out. On the other hand, under the condition that only the heartbeat network is abnormal, some resources of the main node may be normally accessed, and at this time, all the resources on the main node need to be stopped first, and then main-standby switching is performed.

In this embodiment, whether node server resource clearing is needed and then failover is determined according to the state of the storage network. By means of the arbitration disk mechanism, the application service on the main node server can be unloaded under the condition that the double-heartbeat network is abnormal, the consistency of application service migration and the consistency of service states are ensured, and split brains are prevented.

In one embodiment, after the step of determining the state of the second heartbeat network when the abnormality of the first heartbeat network is detected, the method further includes: and when the second heartbeat network is determined to be normal according to the state of the second heartbeat network, controlling and repairing the current active node server. In the process of repairing the active node server, the method may further include a step of sending a repair alarm, where the repair alarm may be: the first heartbeat network fails.

When the second heartbeat network is determined to be normal, it is indicated that the current active node server can normally access the network, so that it is possible that the active node network has a resource abnormality, that is, the node server is abnormal rather than the network abnormality.

Further, the step of controlling and repairing the current active node server includes: sending a repair instruction to the current main node server; the repair instruction is used for controlling the current main node server to repair resources; acquiring the repair state of the current main node server; and when the current node main server is determined to fail to be repaired according to the repair state, unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

Sending the repair instruction to control the repair of the node main server can be understood as automatic repair. The automatic repair can control the main node server to automatically repair the resource state. In addition, when a fault which cannot be automatically repaired occurs, the fault can be manually repaired.

In this embodiment, when only one heartbeat network is abnormal, the primary node server is repaired first. And when the repair fails, clearing the application service on the current main node server and carrying out main-standby switching. The problem of split brain caused by the fact that the main node server and the standby node server are switched immediately when a heartbeat network is in trouble can be effectively prevented, and normal fault repair is guaranteed when the main node server is in resource abnormality.

In an embodiment, the step of determining a new active node server and performing active-standby switching includes: selecting a standby node server as a new main node server; and degrading the current main node server into a standby node server, and upgrading the new main node server into a main node server. As shown in fig. 3, one node server may be selected from a plurality of standby node servers as a new active node server.

Wherein the standby node server may be set with a priority. When the control server selects a new primary node server from the standby node servers, the standby node server with a higher priority can be selected as the new primary node server.

In addition, the control server may also detect the status of the standby node server. When the standby node C is abnormal, even if the standby node C is the first priority standby node, the standby node with a normal state is selected by skipping C for switching.

Meanwhile, the abnormality of the standby node server can also be repaired, for example, manually repaired by a user.

Further, after the current active node server is degraded to the standby node server, the current active node server can be marked as abnormal, and a user is informed that the repair is needed.

In this embodiment, one standby node server can be selected as a new active node server, and it can be effectively ensured that the high-availability cluster continuously provides services to the outside.

After the active-standby switching is completed, the storage heartbeat of the new active node server may not match the storage server. In order to ensure the normal operation of the high-availability cluster, it is necessary to control a new primary node server to update the stored heartbeat information.

After the master-slave switching is completed, the control server needs to continuously monitor the states of the first heartbeat network, the second heartbeat network and the storage network in real time. When the heartbeat network has a problem again, repeated heartbeat detection can be continuously carried out through the double heartbeat link, and the node server is controlled according to the state of the other heartbeat network.

In this embodiment, after the active-standby switching is completed, the new active node server is controlled to update the storage heartbeat, and the control server monitors the states of the heartbeat network and the storage network in real time. The high-availability cluster after the main/standby switching is finished can be ensured to continue to normally operate.

In order to better understand the above method, an application example of the implementation method of the present invention based on the arbitration disk and the high availability cluster of the dual link is described in detail below as shown in fig. 4.

Step s 11: after the high availability is established, the control server sends a disk dividing instruction to the storage server to control the storage server to divide the space, format the arbitration disk and establish a corresponding storage network. At this time, the control server and the node server are both accessed into this storage network.

Step s 12: and the control server sends a heartbeat initialization instruction to the node server. And the node server initializes the arbitration disk information and initializes and stores the heartbeat according to the heartbeat initialization instruction.

Step s 13: two sets of private networks a and B are established. The control server and the node server are both accessed into the two groups of private networks.

Step s 14: and the node server sends the storage heartbeat to the storage network.

Step s 15: the control server monitors the stored heartbeat.

Step s 16: the node server sends the network heartbeat to the private network a.

Step s 21: when the control server monitors that the heartbeat network A has a fault, the state of the heartbeat network B is checked; if the heartbeat network B is normal, sending a repair alarm, and performing normal fault repair and switching (which can be switched to step s 31); if the heartbeat network B is abnormal, step s22 is executed.

Step s 22: the control server checks the storage heartbeat of the arbitration disk; if the stored heartbeat is normal, then step s23 is performed; if the heart beat is not stored normally, step s24 is executed.

Step s 23: and the control server sends an instruction for unloading the application service to the current main node server through the arbitration disk. The current primary node server (old primary node server) unloads the application service according to the instruction.

Step s 24: the control server selects a standby node server to perform application service migration, and upgrades the standby node server to be a new main node server; and degrading the old main node server, setting an abnormal state for the old main node server, and sending warning information.

Step s 25: the new primary node server takes over the application service of the primary node server.

Step s 26: and the new main node server updates and stores the heartbeat information.

Step s 27: the control server monitors the new stored heartbeat and the network heartbeat. If the network heartbeat is again abnormal, step s21 is executed.

Step s 31: when receiving a service failure message of the application of the main node server or only one heartbeat network in the heartbeat network A and the heartbeat network B is abnormal, the control server sends a repair instruction to the current main node; the primary node server fails repair, then step s32 is performed.

Step s 32: and the control server sends an instruction for unloading the application service to the current main node server through the network.

Step s 33: the control server selects a new primary node server and controls the new primary node server to take over the application service of the old primary node server.

Step s 34: and the new main node server updates arbitration storage heartbeat information and controls the server to monitor the new storage heartbeat.

In the embodiment, through the arbitration disk and the double-link cluster mode, miscut can be avoided under the condition that the application service is normal to the greatest extent, and the split brain can be effectively prevented, so that the high-availability cluster is more stable and reliable.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

Based on the same idea as the implementation method of the arbitration disk and double link-based high availability cluster in the above embodiment, the present invention further provides an arbitration disk and double link-based high availability cluster system, which can be used to execute the implementation method of the arbitration disk and double link-based high availability cluster. For convenience of illustration, the structure diagram of the embodiment of the high availability cluster system based on the arbitration disk and the dual link only shows a part related to the embodiment of the present invention, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the system, and may include more or less components than those illustrated, or combine some components, or arrange different components.

As shown in fig. 5, the arbitrating disk and dual-link based high-availability cluster system includes a heartbeat network state determination module 501, a storage network state determination module 502, and a main/standby switching module 503, which are described in detail as follows:

a heartbeat network state determining module 501, configured to determine a state of the second heartbeat network when the first heartbeat network is detected to be abnormal.

A storage network state determining module 502, configured to determine a state of a storage network when determining that a second heartbeat network is abnormal according to the state of the second heartbeat network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server.

And a primary/standby switching module 503, configured to control the node server to perform primary/standby switching according to the state of the storage network.

According to the embodiment, the phenomenon of mistaken switching caused by abnormal heartbeat of the node server can be effectively reduced, and normal operation of main-standby switching in the high-availability cluster is ensured.

In one embodiment, the active/standby switching module 503 includes: the first switching module is used for judging the current main node server is down when the storage network is determined to be abnormal according to the state of the storage network, determining a new main node server and performing main-standby switching; the second switching module is used for judging that the heartbeat network is abnormal when the storage network is determined to be normal according to the state of the storage network; and unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

In one embodiment, further comprising: and the repairing module is used for controlling and repairing the current active node server when the second heartbeat network is determined to be normal according to the state of the second heartbeat network.

In one embodiment, a repair module includes: a repair instruction sending submodule, configured to send a repair instruction to the current master node server; the repair instruction is used for controlling the current main node server to repair resources; a repair state obtaining submodule, configured to obtain a repair state of the current primary node server; and the main/standby switching submodule is used for unloading the application service of the current main node server and determining a new main node server and performing main/standby switching when the current node main server is determined to be failed to be repaired according to the repair state.

In one embodiment, further comprising: the server selection module is used for selecting one standby node server as a new main node server; and the active-standby switching module is used for degrading the current active node server into a standby node server and upgrading the new active node server into an active node server.

In one embodiment, further comprising: the heartbeat updating module is used for sending a heartbeat updating instruction to the new main node server; the heartbeat updating instruction is used for controlling the new main node server to update and store heartbeat information; the storage heartbeat is a heartbeat corresponding to the storage server; and the heartbeat monitoring module is used for monitoring the first heartbeat network, the second heartbeat network and/or the storage network.

In one embodiment, further comprising: the heartbeat network connection module is used for sending a heartbeat network connection instruction to the connected node server; the heartbeat network connection instruction is used for controlling the node server to connect the first heartbeat network and the second heartbeat network.

In one embodiment, further comprising: the storage network connection module is used for sending a disk dividing instruction to the storage server; the disk dividing instruction is used for controlling the storage server to divide a specific disk space as an arbitration disk; establishing a storage network according to the arbitration disk, and connecting the storage network; sending a storage network connection instruction to the connected node server; the storage network connection instruction is used for controlling the node server to connect the storage network.

It should be noted that, the arbitration disk and dual-link based high availability cluster system of the present invention corresponds to the arbitration disk and dual-link based high availability cluster implementation method of the present invention one to one, and the technical features and the beneficial effects thereof described in the above embodiment of the arbitration disk and dual-link based high availability cluster implementation method are both applicable to the embodiment of the arbitration disk and dual-link based high availability cluster system, and specific contents may refer to the description in the embodiment of the method of the present invention, and are not described herein again, and thus it is stated that.

In addition, in the above exemplary embodiment of the arbitration disk and dual-link based high availability cluster system, the logical division of the program modules is only an example, and in practical applications, the above function allocation may be performed by different program modules according to needs, for example, due to configuration requirements of corresponding hardware or due to convenience of implementation of software, that is, the internal structure of the arbitration disk and dual-link based high availability cluster system is divided into different program modules to perform all or part of the above described functions.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as network states. The network interface of the computer device is used for communicating with an external terminal, a storage server, a node server, and the like through network connection. The computer program is executed by a processor to implement a method for implementing a high availability cluster based on an arbitration disk and a dual link.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: when the first heartbeat network is detected to be abnormal, determining the state of a second heartbeat network; when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network, determining the state of a storage network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server; and controlling the node server to perform active-standby switching according to the state of the storage network.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when determining that the storage network is abnormal according to the state of the storage network, judging that the current main node server is down, determining a new main node server and performing main-standby switching; when the storage network is determined to be normal according to the state of the storage network, judging that the heartbeat network is abnormal; and unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and when the second heartbeat network is determined to be normal according to the state of the second heartbeat network, controlling and repairing the current active node server.

In one embodiment, the processor, when executing the computer program, further performs the steps of: sending a repair instruction to the current main node server; the repair instruction is used for controlling the current main node server to repair resources; acquiring the repair state of the current main node server; and when the current node main server is determined to fail to be repaired according to the repair state, unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

In one embodiment, the processor, when executing the computer program, further performs the steps of: selecting a standby node server as a new main node server; and degrading the current main node server into a standby node server, and upgrading the new main node server into a main node server.

In one embodiment, the processor, when executing the computer program, further performs the steps of: sending a heartbeat updating instruction to the new main node server; the heartbeat updating instruction is used for controlling the new main node server to update and store heartbeat information; the storage heartbeat is a heartbeat corresponding to the storage server; the first heartbeat network, the second heartbeat network and/or the storage network are monitored.

In one embodiment, the processor, when executing the computer program, further performs the steps of: sending a heartbeat network connection instruction to the connected node server; the heartbeat network connection instruction is used for controlling the node server to connect the first heartbeat network and the second heartbeat network.

In one embodiment, the processor, when executing the computer program, further performs the steps of: sending a disk partitioning instruction to the storage server; the disk dividing instruction is used for controlling the storage server to divide a specific disk space as an arbitration disk; establishing a storage network according to the arbitration disk, and connecting the storage network; sending a storage network connection instruction to the connected node server; the storage network connection instruction is used for controlling the node server to connect the storage network.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: when the first heartbeat network is detected to be abnormal, determining the state of a second heartbeat network; when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network, determining the state of a storage network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server; and controlling the node server to perform active-standby switching according to the state of the storage network.

In one embodiment, the computer program when executed by the processor further performs the steps of: when determining that the storage network is abnormal according to the state of the storage network, judging that the current main node server is down, determining a new main node server and performing main-standby switching; when the storage network is determined to be normal according to the state of the storage network, judging that the heartbeat network is abnormal; and unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

In one embodiment, the computer program when executed by the processor further performs the steps of: and when the second heartbeat network is determined to be normal according to the state of the second heartbeat network, controlling and repairing the current active node server.

In one embodiment, the computer program when executed by the processor further performs the steps of: sending a repair instruction to the current main node server; the repair instruction is used for controlling the current main node server to repair resources; acquiring the repair state of the current main node server; and when the current node main server is determined to fail to be repaired according to the repair state, unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

In one embodiment, the computer program when executed by the processor further performs the steps of: selecting a standby node server as a new main node server; and degrading the current main node server into a standby node server, and upgrading the new main node server into a main node server.

In one embodiment, the computer program when executed by the processor further performs the steps of: sending a heartbeat updating instruction to the new main node server; the heartbeat updating instruction is used for controlling the new main node server to update and store heartbeat information; the storage heartbeat is a heartbeat corresponding to the storage server; the first heartbeat network, the second heartbeat network and/or the storage network are monitored.

In one embodiment, the computer program when executed by the processor further performs the steps of: sending a heartbeat network connection instruction to the connected node server; the heartbeat network connection instruction is used for controlling the node server to connect the first heartbeat network and the second heartbeat network.

In one embodiment, the computer program when executed by the processor further performs the steps of: sending a disk partitioning instruction to the storage server; the disk dividing instruction is used for controlling the storage server to divide a specific disk space as an arbitration disk; establishing a storage network according to the arbitration disk, and connecting the storage network; sending a storage network connection instruction to the connected node server; the storage network connection instruction is used for controlling the node server to connect the storage network.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium and sold or used as a stand-alone product. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

The terms "comprises" and "comprising," and any variations thereof, of embodiments of the present invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or (module) elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-described examples merely represent several embodiments of the present invention and should not be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for realizing a high-availability cluster based on an arbitration disk and a double link is characterized by comprising the following steps:

the control server is connected with the first heartbeat network and the second heartbeat network and sends a heartbeat network connection instruction to the connected node server; the heartbeat network connection instruction is used for controlling the node server to connect the first heartbeat network and the second heartbeat network;

when the first heartbeat network is detected to be abnormal, the control server determines the state of a second heartbeat network;

when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network, the control server determines the state of a storage network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server;

when the second heartbeat network is determined to be normal according to the state of the second heartbeat network, the control server controls and repairs the current main node server;

the control server controls the node server to carry out active-standby switching according to the state of the storage network; the master-slave switching comprises switching between a master node server and a plurality of standby node servers;

the step of controlling the node server to switch between the main node and the standby node by the control server according to the state of the storage network comprises the following steps: when determining that the storage network is abnormal according to the state of the storage network, the control server judges that the current main node server is down, determines a new main node server and performs main-standby switching; when the storage network is determined to be normal according to the state of the storage network, the control server judges that the heartbeat network is abnormal; and unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

2. The method according to claim 1, wherein the step of controlling and repairing the current active node server includes:

sending a repair instruction to the current main node server; the repair instruction is used for controlling the current main node server to repair resources;

acquiring the repair state of the current main node server;

and when the current node main server is determined to fail to be repaired according to the repair state, unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

3. The method for implementing a high availability cluster based on an arbitration disk and a dual link according to claim 1 or 2, wherein the step of determining a new active node server and performing active-standby switching comprises:

selecting a standby node server as a new main node server;

and degrading the current main node server into a standby node server, and upgrading the new main node server into a main node server.

4. The method for implementing a high availability cluster based on an arbitration disk and a dual link according to claim 3, wherein after the step of upgrading the new active node server to an active node server, the method further comprises:

sending a heartbeat updating instruction to the new main node server; the heartbeat updating instruction is used for controlling the new main node server to update and store heartbeat information; the storage heartbeat is a heartbeat corresponding to the storage server;

the first heartbeat network, the second heartbeat network and/or the storage network are monitored.

5. The method for implementing a high availability cluster based on an arbitration disk and a dual link according to claim 1, 2 or 4, wherein the step of determining the state of the second heartbeat network when the first heartbeat network is detected to be abnormal is preceded by the steps of:

sending a disk partitioning instruction to the storage server; the disk dividing instruction is used for controlling the storage server to divide a specific disk space as an arbitration disk; establishing a storage network according to the arbitration disk, and connecting the storage network; sending a storage network connection instruction to the connected node server; the storage network connection instruction is used for controlling the node server to connect the storage network.

6. A high availability cluster system based on an arbitration disk and a dual link, comprising:

the heartbeat network connection module is used for controlling the server to connect the first heartbeat network and the second heartbeat network and sending a heartbeat network connection instruction to the connected node server; the heartbeat network connection instruction is used for controlling the node server to connect the first heartbeat network and the second heartbeat network;

the heartbeat network state determining module is used for determining the state of the second heartbeat network by the control server when the first heartbeat network is detected to be abnormal;

the storage network state determining module is used for determining the state of the storage network by the control server when the second heartbeat network is determined to be abnormal according to the state of the second heartbeat network; the storage network is a network corresponding to the storage server and is used for representing the state of the node server;

the repair module is used for controlling the server to repair the current master node server when the second heartbeat network is determined to be normal according to the state of the second heartbeat network;

the master-slave switching module is used for controlling the node server to carry out master-slave switching according to the state of the storage network by the control server; the master-slave switching comprises switching between a master node server and a plurality of standby node servers;

the active-standby switching module comprises: the first switching module is used for judging the current main node server is down by the control server when the storage network is determined to be abnormal according to the state of the storage network, determining a new main node server and carrying out main-standby switching; the second switching module is used for judging that the heartbeat network is abnormal by the control server when the storage network is determined to be normal according to the state of the storage network; and unloading the application service of the current main node server, determining a new main node server and performing main-standby switching.

7. The arbitrated disk and dual-link based high availability cluster system of claim 6, wherein said repair module comprises:

a repair instruction sending submodule, configured to send a repair instruction to the current master node server; the repair instruction is used for controlling the current main node server to repair resources; a repair state obtaining submodule, configured to obtain a repair state of the current primary node server; and the main/standby switching submodule is used for unloading the application service of the current main node server and determining a new main node server and performing main/standby switching when the current node main server is determined to be failed to be repaired according to the repair state.

8. The arbitrated disk and dual-link based high availability cluster system according to claim 6 or 7, further comprising:

the server selection module is used for selecting one standby node server as a new main node server; and the active-standby switching module is used for degrading the current active node server into a standby node server and upgrading the new active node server into an active node server.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 5 are implemented by the processor when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.