CN112181660A

CN112181660A - High-availability method based on server cluster

Info

Publication number: CN112181660A
Application number: CN202011083292.1A
Authority: CN
Inventors: 赵博颖; 申玉京; 谭智敏; 詹少博
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-05

Abstract

The invention relates to a high-availability method based on a server cluster, which comprises the following steps: when a certain node in a server cluster fails to perform emergency switching, firstly, a spare node in the current cluster is selected as a takeover node, the takeover node informs the failed node to stop all application services and release resources, and the takeover node starts the application services running on the failed node after obtaining the released resources, wherein the released resources comprise shared storage equipment and IP addresses, so that the collision caused by the simultaneous use of the failed node and the takeover node is avoided; when a fault occurs, the fault node switches the service routes of the database instance and the service access database to the takeover node in real time; when the fault is removed, the original fault node is added into the available sequence, and the data is backed up to the takeover node in real time, so that the database data is completely self-healed.

Description

High-availability method based on server cluster

Technical Field

The invention relates to the technical field of server clusters, in particular to a high-availability implementation method based on a server cluster.

Background

Once a node in the server cluster processing the core task fails, the data link may be broken, and the information may be lost, which may possibly cause catastrophic results. In order to ensure that the platform provides service uninterruptedly and guarantee information safety, the invention provides a high-availability management method. The method is mainly used for the cooperative management of all computing nodes in the platform and solving the system fault caused by the failure of a single computing or control unit. The method can monitor the key service and the running state of each node, recover service faults, and perform service migration of the main node and the standby node based on the redundant computing and control unit if necessary, thereby improving the stability, the availability and the load balancing capability of a service system and improving the fault tolerance capability of the system to software and hardware faults.

Disclosure of Invention

The invention aims to provide a high-availability method based on a server cluster, which is used for solving the key problem that the high availability and the non-continuity of the service system become urgent needs to be solved in many fields such as computers and the like. .

According to an embodiment of the server cluster-based high availability method of the present invention, the real-time status information of the node or the application is periodically transmitted to all nodes through the heartbeat network as a heartbeat signal, and if each node does not receive the heartbeat signal of a certain node within a certain time, the certain node is considered to be faulty.

An embodiment of the server cluster-based high availability method according to the present invention includes three node state detection mechanisms, a ping mechanism for checking communication state, a register mechanism for reporting resource state, and a health check mechanism for customizing script by user.

According to an embodiment of the server cluster based high availability method according to the present invention, the takeover node restarts the failed node through the STONITH device to release resources.

According to an embodiment of the server cluster-based high availability method of the present invention, the state information in the heartbeat signal, including the application service state, the connectivity of the node to the external network, the operating system state, and the resource occupation condition, is used to determine whether the node is normal and to select the takeover node when the application is switched.

An embodiment of the highly available method based on server clusters according to the invention is wherein encryption and authentication is performed at the time of heartbeat signal transmission.

According to one embodiment of the server cluster-based high availability method, when it is determined that one of the nodes providing services fails or fails, the application on the failed node is switched to another node to continue providing services according to a predetermined policy.

According to an embodiment of the server cluster-based high availability method, the lowest layer of the system is a heartbeat layer, all nodes in the server cluster are monitored mutually in real time, and a heartbeat layer component sends heartbeat information and data and issues the working state of the heartbeat layer component to the upper layer; the middle layer is an application distribution layer and is responsible for managing and scheduling the applications running in the system, each action of the application distribution layer is managed through the system application, the top layer is an application layer, and starting, stopping and monitoring control of the applications are achieved through a shell script mode.

According to an embodiment of the server cluster-based high availability method, the middle layer comprises application management, information reference and scheduling policy components, and the application management is responsible for managing and scheduling the applications running on the nodes; the information benchmark is used for storing cluster configuration, state, nodes, resources and limiting conditions; the scheduling strategy provides fault migration strategies of the nodes, including a directional strategy and a load balancing strategy.

According to an embodiment of the invention, the high availability method based on the server cluster is characterized in that the application service is monitored through a customized application proxy script.

The invention discloses a server cluster which has high requirements on high availability and reliability of a platform, and provides a general high-availability implementation method based on a three-layer system architecture. The method mainly comprises the following steps: constructing a high-availability cluster system architecture; the fault monitoring of the nodes is realized based on three detection mechanisms; realizing the fault switching of the nodes based on a resource isolation mechanism; the high availability of data is realized based on a database real-time synchronization mechanism.

Drawings

FIG. 1 is a layered structure diagram of a high availability cluster system;

FIG. 2 is a diagram of a database high availability architecture.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

The invention designs a layered structure for realizing a high-availability system, as shown in figure 1, the lowest layer is a heartbeat layer, all nodes in a server cluster are monitored mutually in real time to ensure that the fault state of the nodes or services is acquired at the first time, and components contained in the heartbeat layer send heartbeat information and information thereof to issue the working state of the heartbeat layer to the upper layer. The second layer is an application distribution layer, which is mainly composed of components such as application management, information reference, scheduling strategy and the like and is responsible for managing and scheduling the applications operated by the system. Each action of the application distribution layer is managed by a system application, which is the basis for maintaining system information. The third layer is an application layer, mainly comprising application agents, wherein the start, stop and monitoring control of the application are realized in a shell script mode.

The invention carries out fault monitoring through a heartbeat mechanism. The high-availability system transmits the real-time state information of the node or the application as a heartbeat signal to all other nodes through a heartbeat network at regular intervals, and if the other nodes do not receive the heartbeat signal of the node within a certain time, the node is considered to be in fault. In order to improve the accuracy and the rapidity of fault monitoring, the following optimization measures are provided:

three node state detection mechanisms are provided: a ping mechanism to check the status of communications, a register mechanism to report the status of resources, and a health check mechanism where scripts can be customized by the user. And meanwhile, monitoring of the application service is realized through a self-defined application proxy script.

In order to reduce the occurrence of false alarm, a plurality of state information is added to the heartbeat signal, including an application service state, connectivity of the node to an external network, an operating system state, a resource occupation condition and the like, which can be used for judging whether the node is normal or not and selecting a basis for taking over the node when the application is switched.

In order to ensure the communication safety between the nodes, mechanisms such as encryption, authentication and the like are adopted during heartbeat signal transmission, so that important data is prevented from being stolen, and unauthorized nodes are prevented from being added into a high-availability system or unauthorized node state information is prevented from influencing the switching of the nodes.

When it is determined that one of the servicing nodes fails or fails, the high availability system will automatically transparently switch the application on the failed node to another node to continue servicing according to the established policy. The present invention proposes the following improvement strategies in the implementation of failover:

the takeover node firstly informs the current node to stop all application services and release resources, the takeover node can start the services after obtaining the released resources, and the released resources mainly comprise shared storage equipment, IP addresses and the like, so that the two nodes are prevented from generating conflict when being used simultaneously; the take-over node restarts the fault node to release resources through the STONITH device, wherein the STONITH is an intelligent power supply device used for providing power supply for the server node, and can control the power supply of the node by sending a disconnection or reset instruction to the STONITH device through a serial port line or a network cable;

in order to prevent the fault node resource from being suspended and released, a resource isolation mechanism is introduced, and the take-over node enables the fault node to be restarted through intelligent power supply equipment so as to release the resource.

The invention provides a real-time synchronization mechanism of the database aiming at the requirement of high availability of data, and the data is backed up to other nodes in real time while the service data generated by the service system is put in storage. The database real-time synchronization mechanism is shown in fig. 2. When a fault occurs, the server node switches the service routes of the database instance and the service access database to the backup node in real time; when the fault is removed, the system automatically adds the fault recovery node into the available sequence, and simultaneously backups the data to the fault backup node in real time, and finally achieves the purpose of completely self-healing the data in the database.

The invention also provides several improved strategies aiming at the problems of common false alarm condition, communication safety, resource migration and the like in the current high-availability system: fault monitoring based on three detection mechanisms, fault switching based on a resource isolation mechanism, high availability of data based on a database real-time synchronization mechanism and the like.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A highly available method based on server clustering, comprising:

when a certain node in a server cluster fails to perform emergency switching, firstly, a spare node in the current cluster is selected as a takeover node, the takeover node informs the failed node to stop all application services and release resources, and the takeover node starts the application services running on the failed node after obtaining the released resources, wherein the released resources comprise shared storage equipment and IP addresses, so that the collision caused by the simultaneous use of the failed node and the takeover node is avoided;

when a fault occurs, the fault node switches the service routes of the database instance and the service access database to the takeover node in real time; when the fault is removed, the original fault node is added into the available sequence, and the data is backed up to the takeover node in real time, so that the database data is completely self-healed.

2. The server cluster-based high availability method according to claim 1, wherein the real-time status information of the node or the application is periodically transmitted to all nodes through the heartbeat network as a heartbeat signal, and if each node does not receive the heartbeat signal of a certain node within a certain time, the certain node is considered to be failed.

3. The server cluster-based highly available method according to claim 1, comprising three node state detection mechanisms, a ping mechanism to check communication state, a register mechanism to report resource state, and a health check mechanism where scripts can be customized by a user.

4. The server cluster-based high availability method of claim 1, wherein the takeover node reboots the failed node through a STONITH device to release resources.

5. The server cluster-based high availability method of claim 1, wherein the state information in the heartbeat signal, including application service state, connectivity of the node to external networks, operating system state, and resource occupancy, is used to determine whether the node is normal and to select a takeover node when the application is handed over.

6. The server cluster-based high availability method of claim 1, wherein encryption and authentication are performed at heartbeat signaling.

7. The server cluster-based high availability method of claim 1, wherein when it is determined that one of the servicing nodes fails or fails, the application on the failed node is switched to another node to continue servicing according to a predetermined policy.

8. The server cluster-based high availability method according to claim 1, wherein the lowest layer of the system is a heartbeat layer, each node in the server cluster monitors each other in real time, and a heartbeat layer component sends heartbeat information and data to issue its own working state to the upper layer; the middle layer is an application distribution layer and is responsible for managing and scheduling the applications running in the system, each action of the application distribution layer is managed through the system application, the top layer is an application layer, and starting, stopping and monitoring control of the applications are achieved through a shell script mode.

9. The server cluster-based high availability method of claim 8, wherein the middle tier comprises application management, information benchmark, and scheduling policy components, the application management responsible for managing and scheduling applications running on the nodes; the information benchmark is used for storing cluster configuration, state, nodes, resources and limiting conditions; the scheduling strategy provides fault migration strategies of the nodes, including a directional strategy and a load balancing strategy.

10. The server cluster-based high availability method of claim 1, wherein application services are monitored through customized application proxy scripts.