CN112181660A - High-availability method based on server cluster - Google Patents

High-availability method based on server cluster Download PDF

Info

Publication number
CN112181660A
CN112181660A CN202011083292.1A CN202011083292A CN112181660A CN 112181660 A CN112181660 A CN 112181660A CN 202011083292 A CN202011083292 A CN 202011083292A CN 112181660 A CN112181660 A CN 112181660A
Authority
CN
China
Prior art keywords
node
server cluster
application
takeover
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011083292.1A
Other languages
Chinese (zh)
Inventor
赵博颖
申玉京
谭智敏
詹少博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202011083292.1A priority Critical patent/CN112181660A/en
Publication of CN112181660A publication Critical patent/CN112181660A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration

Abstract

The invention relates to a high-availability method based on a server cluster, which comprises the following steps: when a certain node in a server cluster fails to perform emergency switching, firstly, a spare node in the current cluster is selected as a takeover node, the takeover node informs the failed node to stop all application services and release resources, and the takeover node starts the application services running on the failed node after obtaining the released resources, wherein the released resources comprise shared storage equipment and IP addresses, so that the collision caused by the simultaneous use of the failed node and the takeover node is avoided; when a fault occurs, the fault node switches the service routes of the database instance and the service access database to the takeover node in real time; when the fault is removed, the original fault node is added into the available sequence, and the data is backed up to the takeover node in real time, so that the database data is completely self-healed.

Description

High-availability method based on server cluster
Technical Field
The invention relates to the technical field of server clusters, in particular to a high-availability implementation method based on a server cluster.
Background
Once a node in the server cluster processing the core task fails, the data link may be broken, and the information may be lost, which may possibly cause catastrophic results. In order to ensure that the platform provides service uninterruptedly and guarantee information safety, the invention provides a high-availability management method. The method is mainly used for the cooperative management of all computing nodes in the platform and solving the system fault caused by the failure of a single computing or control unit. The method can monitor the key service and the running state of each node, recover service faults, and perform service migration of the main node and the standby node based on the redundant computing and control unit if necessary, thereby improving the stability, the availability and the load balancing capability of a service system and improving the fault tolerance capability of the system to software and hardware faults.
Disclosure of Invention
The invention aims to provide a high-availability method based on a server cluster, which is used for solving the key problem that the high availability and the non-continuity of the service system become urgent needs to be solved in many fields such as computers and the like. .
The invention relates to a high-availability method based on a server cluster, which comprises the following steps: when a certain node in a server cluster fails to perform emergency switching, firstly, a spare node in the current cluster is selected as a takeover node, the takeover node informs the failed node to stop all application services and release resources, and the takeover node starts the application services running on the failed node after obtaining the released resources, wherein the released resources comprise shared storage equipment and IP addresses, so that the collision caused by the simultaneous use of the failed node and the takeover node is avoided; when a fault occurs, the fault node switches the service routes of the database instance and the service access database to the takeover node in real time; when the fault is removed, the original fault node is added into the available sequence, and the data is backed up to the takeover node in real time, so that the database data is completely self-healed.
According to an embodiment of the server cluster-based high availability method of the present invention, the real-time status information of the node or the application is periodically transmitted to all nodes through the heartbeat network as a heartbeat signal, and if each node does not receive the heartbeat signal of a certain node within a certain time, the certain node is considered to be faulty.
An embodiment of the server cluster-based high availability method according to the present invention includes three node state detection mechanisms, a ping mechanism for checking communication state, a register mechanism for reporting resource state, and a health check mechanism for customizing script by user.
According to an embodiment of the server cluster based high availability method according to the present invention, the takeover node restarts the failed node through the STONITH device to release resources.
According to an embodiment of the server cluster-based high availability method of the present invention, the state information in the heartbeat signal, including the application service state, the connectivity of the node to the external network, the operating system state, and the resource occupation condition, is used to determine whether the node is normal and to select the takeover node when the application is switched.
An embodiment of the highly available method based on server clusters according to the invention is wherein encryption and authentication is performed at the time of heartbeat signal transmission.
According to one embodiment of the server cluster-based high availability method, when it is determined that one of the nodes providing services fails or fails, the application on the failed node is switched to another node to continue providing services according to a predetermined policy.
According to an embodiment of the server cluster-based high availability method, the lowest layer of the system is a heartbeat layer, all nodes in the server cluster are monitored mutually in real time, and a heartbeat layer component sends heartbeat information and data and issues the working state of the heartbeat layer component to the upper layer; the middle layer is an application distribution layer and is responsible for managing and scheduling the applications running in the system, each action of the application distribution layer is managed through the system application, the top layer is an application layer, and starting, stopping and monitoring control of the applications are achieved through a shell script mode.
According to an embodiment of the server cluster-based high availability method, the middle layer comprises application management, information reference and scheduling policy components, and the application management is responsible for managing and scheduling the applications running on the nodes; the information benchmark is used for storing cluster configuration, state, nodes, resources and limiting conditions; the scheduling strategy provides fault migration strategies of the nodes, including a directional strategy and a load balancing strategy.
According to an embodiment of the invention, the high availability method based on the server cluster is characterized in that the application service is monitored through a customized application proxy script.
The invention discloses a server cluster which has high requirements on high availability and reliability of a platform, and provides a general high-availability implementation method based on a three-layer system architecture. The method mainly comprises the following steps: constructing a high-availability cluster system architecture; the fault monitoring of the nodes is realized based on three detection mechanisms; realizing the fault switching of the nodes based on a resource isolation mechanism; the high availability of data is realized based on a database real-time synchronization mechanism.
Drawings
FIG. 1 is a layered structure diagram of a high availability cluster system;
FIG. 2 is a diagram of a database high availability architecture.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention designs a layered structure for realizing a high-availability system, as shown in figure 1, the lowest layer is a heartbeat layer, all nodes in a server cluster are monitored mutually in real time to ensure that the fault state of the nodes or services is acquired at the first time, and components contained in the heartbeat layer send heartbeat information and information thereof to issue the working state of the heartbeat layer to the upper layer. The second layer is an application distribution layer, which is mainly composed of components such as application management, information reference, scheduling strategy and the like and is responsible for managing and scheduling the applications operated by the system. Each action of the application distribution layer is managed by a system application, which is the basis for maintaining system information. The third layer is an application layer, mainly comprising application agents, wherein the start, stop and monitoring control of the application are realized in a shell script mode.
The invention carries out fault monitoring through a heartbeat mechanism. The high-availability system transmits the real-time state information of the node or the application as a heartbeat signal to all other nodes through a heartbeat network at regular intervals, and if the other nodes do not receive the heartbeat signal of the node within a certain time, the node is considered to be in fault. In order to improve the accuracy and the rapidity of fault monitoring, the following optimization measures are provided:
three node state detection mechanisms are provided: a ping mechanism to check the status of communications, a register mechanism to report the status of resources, and a health check mechanism where scripts can be customized by the user. And meanwhile, monitoring of the application service is realized through a self-defined application proxy script.
In order to reduce the occurrence of false alarm, a plurality of state information is added to the heartbeat signal, including an application service state, connectivity of the node to an external network, an operating system state, a resource occupation condition and the like, which can be used for judging whether the node is normal or not and selecting a basis for taking over the node when the application is switched.
In order to ensure the communication safety between the nodes, mechanisms such as encryption, authentication and the like are adopted during heartbeat signal transmission, so that important data is prevented from being stolen, and unauthorized nodes are prevented from being added into a high-availability system or unauthorized node state information is prevented from influencing the switching of the nodes.
When it is determined that one of the servicing nodes fails or fails, the high availability system will automatically transparently switch the application on the failed node to another node to continue servicing according to the established policy. The present invention proposes the following improvement strategies in the implementation of failover:
the takeover node firstly informs the current node to stop all application services and release resources, the takeover node can start the services after obtaining the released resources, and the released resources mainly comprise shared storage equipment, IP addresses and the like, so that the two nodes are prevented from generating conflict when being used simultaneously; the take-over node restarts the fault node to release resources through the STONITH device, wherein the STONITH is an intelligent power supply device used for providing power supply for the server node, and can control the power supply of the node by sending a disconnection or reset instruction to the STONITH device through a serial port line or a network cable;
in order to prevent the fault node resource from being suspended and released, a resource isolation mechanism is introduced, and the take-over node enables the fault node to be restarted through intelligent power supply equipment so as to release the resource.
The invention provides a real-time synchronization mechanism of the database aiming at the requirement of high availability of data, and the data is backed up to other nodes in real time while the service data generated by the service system is put in storage. The database real-time synchronization mechanism is shown in fig. 2. When a fault occurs, the server node switches the service routes of the database instance and the service access database to the backup node in real time; when the fault is removed, the system automatically adds the fault recovery node into the available sequence, and simultaneously backups the data to the fault backup node in real time, and finally achieves the purpose of completely self-healing the data in the database.
The invention discloses a server cluster which has high requirements on high availability and reliability of a platform, and provides a general high-availability implementation method based on a three-layer system architecture. The method mainly comprises the following steps: constructing a high-availability cluster system architecture; the fault monitoring of the nodes is realized based on three detection mechanisms; realizing the fault switching of the nodes based on a resource isolation mechanism; the high availability of data is realized based on a database real-time synchronization mechanism.
The invention also provides several improved strategies aiming at the problems of common false alarm condition, communication safety, resource migration and the like in the current high-availability system: fault monitoring based on three detection mechanisms, fault switching based on a resource isolation mechanism, high availability of data based on a database real-time synchronization mechanism and the like.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A highly available method based on server clustering, comprising:
when a certain node in a server cluster fails to perform emergency switching, firstly, a spare node in the current cluster is selected as a takeover node, the takeover node informs the failed node to stop all application services and release resources, and the takeover node starts the application services running on the failed node after obtaining the released resources, wherein the released resources comprise shared storage equipment and IP addresses, so that the collision caused by the simultaneous use of the failed node and the takeover node is avoided;
when a fault occurs, the fault node switches the service routes of the database instance and the service access database to the takeover node in real time; when the fault is removed, the original fault node is added into the available sequence, and the data is backed up to the takeover node in real time, so that the database data is completely self-healed.
2. The server cluster-based high availability method according to claim 1, wherein the real-time status information of the node or the application is periodically transmitted to all nodes through the heartbeat network as a heartbeat signal, and if each node does not receive the heartbeat signal of a certain node within a certain time, the certain node is considered to be failed.
3. The server cluster-based highly available method according to claim 1, comprising three node state detection mechanisms, a ping mechanism to check communication state, a register mechanism to report resource state, and a health check mechanism where scripts can be customized by a user.
4. The server cluster-based high availability method of claim 1, wherein the takeover node reboots the failed node through a STONITH device to release resources.
5. The server cluster-based high availability method of claim 1, wherein the state information in the heartbeat signal, including application service state, connectivity of the node to external networks, operating system state, and resource occupancy, is used to determine whether the node is normal and to select a takeover node when the application is handed over.
6. The server cluster-based high availability method of claim 1, wherein encryption and authentication are performed at heartbeat signaling.
7. The server cluster-based high availability method of claim 1, wherein when it is determined that one of the servicing nodes fails or fails, the application on the failed node is switched to another node to continue servicing according to a predetermined policy.
8. The server cluster-based high availability method according to claim 1, wherein the lowest layer of the system is a heartbeat layer, each node in the server cluster monitors each other in real time, and a heartbeat layer component sends heartbeat information and data to issue its own working state to the upper layer; the middle layer is an application distribution layer and is responsible for managing and scheduling the applications running in the system, each action of the application distribution layer is managed through the system application, the top layer is an application layer, and starting, stopping and monitoring control of the applications are achieved through a shell script mode.
9. The server cluster-based high availability method of claim 8, wherein the middle tier comprises application management, information benchmark, and scheduling policy components, the application management responsible for managing and scheduling applications running on the nodes; the information benchmark is used for storing cluster configuration, state, nodes, resources and limiting conditions; the scheduling strategy provides fault migration strategies of the nodes, including a directional strategy and a load balancing strategy.
10. The server cluster-based high availability method of claim 1, wherein application services are monitored through customized application proxy scripts.
CN202011083292.1A 2020-10-12 2020-10-12 High-availability method based on server cluster Pending CN112181660A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011083292.1A CN112181660A (en) 2020-10-12 2020-10-12 High-availability method based on server cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011083292.1A CN112181660A (en) 2020-10-12 2020-10-12 High-availability method based on server cluster

Publications (1)

Publication Number Publication Date
CN112181660A true CN112181660A (en) 2021-01-05

Family

ID=73948161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011083292.1A Pending CN112181660A (en) 2020-10-12 2020-10-12 High-availability method based on server cluster

Country Status (1)

Country Link
CN (1) CN112181660A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112698992A (en) * 2021-03-23 2021-04-23 腾讯科技(深圳)有限公司 Disaster recovery management method and related device for cloud cluster
CN113315653A (en) * 2021-04-30 2021-08-27 新华三大数据技术有限公司 Network equipment management method and device, network equipment and computer equipment
CN113515349A (en) * 2021-07-28 2021-10-19 中国工商银行股份有限公司 High-performance emergency back-switch method and device
CN113596190A (en) * 2021-07-23 2021-11-02 浪潮云信息技术股份公司 Application distributed multi-activity system and method based on Kubernetes
CN114697191A (en) * 2022-03-29 2022-07-01 浪潮云信息技术股份公司 Resource migration method, device, equipment and storage medium
CN114978875A (en) * 2021-02-23 2022-08-30 广州汽车集团股份有限公司 Vehicle-mounted node management method and device and storage medium
CN115134219A (en) * 2022-06-29 2022-09-30 北京飞讯数码科技有限公司 Device resource management method and device, computing device and storage medium
CN115994045A (en) * 2023-02-22 2023-04-21 深圳计算科学研究院 Transaction hosting method and device based on shared storage database cluster
CN117370316A (en) * 2023-12-07 2024-01-09 本原数据(北京)信息技术有限公司 High availability management method and device for database, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050172161A1 (en) * 2004-01-20 2005-08-04 International Business Machines Corporation Managing failover of J2EE compliant middleware in a high availability system
CN103401712A (en) * 2013-07-31 2013-11-20 北京华易互动科技有限公司 Content distribution based intelligent high-availability task processing method and system
CN103647668A (en) * 2013-12-16 2014-03-19 上海证券交易所 Host group decision system in high availability cluster and switching method for host group decision system
CN106790565A (en) * 2016-12-27 2017-05-31 中国电子科技集团公司第五十二研究所 A kind of network attached storage group system
CN110488701A (en) * 2019-08-20 2019-11-22 北京计算机技术及应用研究所 The High Availabitity heat backup method of network and FlexRay bus based on production domesticization processor
CN110784350A (en) * 2019-10-25 2020-02-11 北京计算机技术及应用研究所 Design method of real-time available cluster management system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050172161A1 (en) * 2004-01-20 2005-08-04 International Business Machines Corporation Managing failover of J2EE compliant middleware in a high availability system
CN103401712A (en) * 2013-07-31 2013-11-20 北京华易互动科技有限公司 Content distribution based intelligent high-availability task processing method and system
CN103647668A (en) * 2013-12-16 2014-03-19 上海证券交易所 Host group decision system in high availability cluster and switching method for host group decision system
CN106790565A (en) * 2016-12-27 2017-05-31 中国电子科技集团公司第五十二研究所 A kind of network attached storage group system
CN110488701A (en) * 2019-08-20 2019-11-22 北京计算机技术及应用研究所 The High Availabitity heat backup method of network and FlexRay bus based on production domesticization processor
CN110784350A (en) * 2019-10-25 2020-02-11 北京计算机技术及应用研究所 Design method of real-time available cluster management system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
葛江伟, 田捷, 崔伟东: "一种集群环境下高可用的NFS服务器", 工业控制计算机, no. 08, pages 22 - 24 *
陈小全等主编: "《Linux服务器架设、性能调优、集群管理教程》", vol. 1, 30 April 2011, 北京邮电大学出版社, pages: 301 - 304 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114978875A (en) * 2021-02-23 2022-08-30 广州汽车集团股份有限公司 Vehicle-mounted node management method and device and storage medium
CN112698992A (en) * 2021-03-23 2021-04-23 腾讯科技(深圳)有限公司 Disaster recovery management method and related device for cloud cluster
CN113315653A (en) * 2021-04-30 2021-08-27 新华三大数据技术有限公司 Network equipment management method and device, network equipment and computer equipment
CN113315653B (en) * 2021-04-30 2022-07-12 新华三大数据技术有限公司 Network equipment management method and device, network equipment and computer equipment
CN113596190A (en) * 2021-07-23 2021-11-02 浪潮云信息技术股份公司 Application distributed multi-activity system and method based on Kubernetes
CN113515349A (en) * 2021-07-28 2021-10-19 中国工商银行股份有限公司 High-performance emergency back-switch method and device
CN114697191A (en) * 2022-03-29 2022-07-01 浪潮云信息技术股份公司 Resource migration method, device, equipment and storage medium
CN115134219A (en) * 2022-06-29 2022-09-30 北京飞讯数码科技有限公司 Device resource management method and device, computing device and storage medium
CN115994045A (en) * 2023-02-22 2023-04-21 深圳计算科学研究院 Transaction hosting method and device based on shared storage database cluster
CN117370316A (en) * 2023-12-07 2024-01-09 本原数据(北京)信息技术有限公司 High availability management method and device for database, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112181660A (en) High-availability method based on server cluster
EP3210367B1 (en) System and method for disaster recovery of cloud applications
CN102916825A (en) Management equipment of dual-computer hot standby system, management method and dual-computer hot standby system
CN105302661A (en) System and method for implementing virtualization management platform high availability
CN101179432A (en) Method of implementing high availability of system in multi-machine surroundings
CN110830283B (en) Fault detection method, device, equipment and system
CN107508694B (en) Node management method and node equipment in cluster
CN103036719A (en) Cross-regional service disaster method and device based on main cluster servers
CN103490914A (en) Switching system and switching method for multi-machine hot standby of network application equipment
CN111581287A (en) Control method, system and storage medium for database management
CN113127270A (en) Cloud computing-based 2-out-of-3 safety computer platform
CN113515408A (en) Data disaster tolerance method, device, equipment and medium
CN111385134B (en) Access device dynamic migration method and device access platform
US8370897B1 (en) Configurable redundant security device failover
CN112910751A (en) Method and device for detecting and recovering abnormity of VPN (virtual private network) equipment
CN106027313B (en) Network link disaster tolerance system and method
JP5285044B2 (en) Cluster system recovery method, server, and program
CN101119242B (en) Communication system cluster method, device and cluster service system applying the same
CN114598594B (en) Method, system, medium and equipment for processing application faults under multiple clusters
CN114124803B (en) Device management method and device, electronic device and storage medium
CN111367711A (en) Safety disaster recovery method based on super fusion data
CN114301763A (en) Distributed cluster fault processing method and system, electronic device and storage medium
CN114328033A (en) Method and device for keeping service configuration consistency of high-availability equipment group
US11954509B2 (en) Service continuation system and service continuation method between active and standby virtual servers
CN115408199A (en) Disaster tolerance processing method and device for edge computing node

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination