WO2016202051A1 - Method and device for managing active and backup nodes in communication system and high-availability cluster - Google Patents

Method and device for managing active and backup nodes in communication system and high-availability cluster Download PDF

Info

Publication number
WO2016202051A1
WO2016202051A1 PCT/CN2016/078490 CN2016078490W WO2016202051A1 WO 2016202051 A1 WO2016202051 A1 WO 2016202051A1 CN 2016078490 W CN2016078490 W CN 2016078490W WO 2016202051 A1 WO2016202051 A1 WO 2016202051A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
standby
active
primary
service
Prior art date
Application number
PCT/CN2016/078490
Other languages
French (fr)
Chinese (zh)
Inventor
白涛
陈河堆
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016202051A1 publication Critical patent/WO2016202051A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks

Definitions

  • This document relates to, but is not limited to, the field of communications, and in particular, to a method and apparatus for managing active and standby nodes and a highly available cluster in a communication system.
  • a server is usually composed of two servers, and the primary server is normally served by the primary server.
  • the secondary server also called the standby server
  • the fault detection and takeover process between the primary and secondary servers is a technical difficulty.
  • the typical method needs to be implemented by means of third-party arbitration, that is, the primary and secondary servers regularly report their status to the arbitrator, and the arbitrator determines whether the condition is reached to trigger the fault takeover process.
  • the arbitrator itself fails, the failure takeover cannot be completed normally. Therefore, there is a need for a new active/standby management mechanism to manage the primary and secondary servers.
  • the embodiments of the present invention provide a method and device for managing active and standby nodes in a communication system, and a high-availability cluster, which can provide a new management mode for active/standby switchover.
  • a method for managing active and standby nodes in a communication system comprising:
  • the standby node detects whether the active node is working normally
  • the standby node After detecting that the primary node is not working normally, the standby node triggers execution of the active/standby switching operation.
  • the standby node detects whether the active node is working normally, including:
  • the standby node Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
  • the method further includes:
  • the standby node After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
  • the standby node performs an active/standby switching operation.
  • the method further includes:
  • the standby node forwards the received service request to the primary node.
  • the method further includes:
  • the standby node If the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information that the service requesting the service request is currently unavailable.
  • a device for managing active and standby nodes in a communication system comprising:
  • the detection module is configured to detect whether the active node is working normally
  • the control module is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
  • the detection module includes:
  • a first detecting unit configured to detect whether a heartbeat message from the active node can be received through a link between the primary node and the standby node;
  • a determining unit is configured to determine that the primary node has failed if a heartbeat message from the primary node is not received through the link.
  • the second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
  • the switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
  • the device further comprises:
  • the first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
  • the device further comprises:
  • a second sending unit configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
  • a highly available cluster comprising a first node and a second node of any of the above devices.
  • the first node is further configured to notify the second node to initiate a state switching request after receiving the active/standby switchover to become the active node in the second node, and receive the state switch request. After the consent message of the second node is described, the operation of the first node to become the active node is performed.
  • the embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.
  • FIG. 1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for a client to implement a method for managing a primary and a standby node according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a method for managing a primary and a secondary node by using a primary server according to an embodiment of the present invention. Flow chart of the method;
  • FIG. 4 is a flowchart of a method for implementing a method for managing a primary and a standby node by using a standby server according to an embodiment of the present invention
  • FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention. The method shown in Figure 1 includes:
  • Step 101 The standby node detects whether the active node is working normally.
  • Step 102 After detecting that the active node is not working normally, the standby node triggers execution of an active/standby switchover operation.
  • the method provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.
  • the active node may initiate a connection request to the standby node actively. After the link is successfully established, the active node sends a status switch request message, indicating that the active node requests to switch to the active state, and the standby node is also inactive. The state assumes that the active node can immediately switch to the active state at this time, and the reply response agrees to switch. After receiving the response, the active node sets its own service state to the active state and starts to provide services externally.
  • the standby node detects whether the active node is working normally, including:
  • the standby node Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
  • the message from the active node may be sent by the active node or may be a response message to the message sent by the standby node.
  • the method further includes:
  • the standby node After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
  • the standby node performs an active/standby switching operation.
  • the primary node provides a period of time to solve its own fault, thereby reducing the possibility of migration of the business processing, ensuring the progress of the data processing, and improving the stability of the system.
  • the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information for the service that is used to process the service request is currently unavailable.
  • the standby node forwards the received service request to the primary node.
  • a network communication system includes: a primary server, a standby server, and one or more clients, wherein each client and the primary backup server respectively have a communication link, the primary server and the standby server There is a communication link between them.
  • the primary server communicates externally through a physical network interface, wherein the primary server external communication includes communication with the standby server and other one or more clients, wherein the primary server has a unique IP address; the standby server communicates externally through a physical network interface.
  • the standby server external communication includes communication with the primary server and other one or more clients, and the standby server also has a unique IP address and is different from the IP address of the primary server. Therefore, if the primary server fails offline, its communication link with the standby server and all clients will be disconnected. If the standby server fails offline, its communication link with the primary server and all clients will be broken.
  • FIG. 4 are flowcharts of a method in which a client, a primary server, and a standby server perform a management method of a primary and a secondary node in sequence.
  • the description of Figures 2 to 4 is as follows:
  • the fault detection and takeover between the primary and secondary servers depends on the calculation of the current number of external links and the determination of the existence of the peer server link.
  • Link mapping table used to save all external communication link information of the current host.
  • the key value can use the identifier that uniquely identifies the communication peer, such as the IP address of the peer + port.
  • the value is the last received heartbeat or heartbeat response. The time of the message.
  • the communication client periodically sends a heartbeat message to the communication server, and then the communication server returns a heartbeat response message to the communication client.
  • the communication server After receiving the heartbeat message, the communication server considers that a link has been established, adds a record in the link mapping table, and increases the number of links on the communication server by one.
  • the communication client After receiving the heartbeat response message, the communication client also considers that a link has been successfully established, and adds a record in the link mapping table, and the number of communication client links increases by one.
  • the communication server does not receive a heartbeat message from the same communication client. After a certain period of time (time configurable), the communication server considers that the link has been disconnected, and records the record from its link. Removed from the mapping table, the number of server links is reduced by 1. Similarly, the communication client does not receive the heartbeat response message from the communication server. After a certain time (time configurable), the communication client considers that the link has been disconnected, and moves the record from its link mapping table. In addition, the number of communication client links is reduced by one.
  • the client For convenience of explanation: For the three roles involved in the solution, the client, the primary server, and the standby server respectively set the following parameters:
  • the main server :
  • the client sends a service request message to the primary (standby) server, and the primary (standby) server returns a response message.
  • the primary server sends a status switch request message to the standby server, and the standby server returns a response message.
  • the above two response message formats should include an error code.
  • the response message format is: error code + response message content, and the error code is mainly used to determine whether the request operation is successfully processed, and whether the request needs to be resent.
  • the communication agreement between the primary server and the standby server is initiated by the party as the communication client to initiate a connection request to the other party.
  • the primary server actively initiates a connection request to the standby server, and there is only one communication link between the primary server and the standby server. road.
  • Step 1 Start the primary server and the standby server respectively. Their initial service status is inactive and cannot provide services externally.
  • the primary server initiates a connection request to the standby server. After the link is successfully established, the primary server sends a status switch request message, indicating that the primary server requests the switch to be in an active state, and the standby server is also in an inactive state. It is considered that the active server can immediately switch to the active state at this time, and the response response agrees to switch. After receiving the response, the primary server sets its own service state to an active state and starts to provide external services.
  • Step 2 The client sends a specific service message to the primary or secondary server, and receives a response message.
  • the response message includes an error code, and the error code is used to identify whether the request message is actually processed.
  • Client access is based on the following principle: If the link with the primary server is normal, the request message is sent to the primary server, and vice versa. When the primary or secondary server receives a client request, if the service status is inactive, the client is replied to the service unavailable error code. Unless the link between the client and the primary and secondary servers is disconnected, the client needs to continually retry sending the request message until the other error code is received, indicating that the request message has been successfully processed, and the specific service can be parsed from the response message. The result of the request processing.
  • the retry related logic can be encapsulated into an API for upper layer application calls, and the upper layer application does not need to care about communication details such as retry.
  • Step 3 If the primary server fails offline, its link to the client and backup server will be disconnected. After the standby server detects that the link of the primary server is disconnected, it immediately sets a waiting time (configurable) and waits for the link recovery with the primary server. If the link is restored within this time, it will be received again. The status switch request message of the primary server directly agrees and the entire system is restored to the original state. However, if this time is exceeded and the link with the primary server is still not restored, the standby server sets its own state to the active state and completes the failover. In this process, the client initially detects that the link of the primary server is unavailable, and can only send the request to the standby server. The standby server will always reply to the client service unavailable error code before switching the state to active.
  • a waiting time configurable
  • the response is returned after the service request is processed, and the response contains other error codes (non-service unavailable). If the primary server recovers during this time, the client then sends a request to it until it receives a response message including the non-service unavailable error code.
  • Step 4 If the primary server fails offline, the standby server has been switched to the active state. At this point, if the primary server is repaired and then back online, the primary server will send the secondary server to the standby server. The status switch request message is sent. At this time, the standby server is inactive, but the service request of the client may be being processed at this time, and the existing request processing needs to be completed, so the switch request cannot be immediately agreed, and the reply does not agree. A new service request is sent to the alternate server, and the alternate server reply service is unavailable. After all current business requests have been processed, the replying primary server agrees to its state switching request.
  • the primary server When the primary server initially receives a response from the standby server that does not agree with its state switch, it will continuously resend the state switch request message until it receives the consent response from the standby server. The client needs to send a new service request to the primary server during this process. If the error code is received, the service needs to be retried until it receives a response containing other error codes.
  • the embodiment of the invention further provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are used to execute the above method.
  • FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention.
  • the device shown in Figure 5 includes:
  • the detecting module 501 is configured to detect whether the active node is working normally
  • the control module 502 is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
  • the detecting module 501 includes:
  • a first detecting unit configured to detect whether a message from the active node can be received through a link between the primary node and the standby node
  • a determining unit is configured to determine that the primary node has failed if a message from the primary node is not received over the link.
  • the device further comprises:
  • the second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
  • the switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
  • the device further comprises:
  • the first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
  • the device further comprises:
  • a second sending unit configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
  • the device embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.
  • an embodiment of the present invention provides a high availability cluster, including a first node and a second node including the apparatus shown in FIG. 5.
  • the first node is configured to notify the second node to initiate a state switching request if the first node resumes working after the primary node is switched to become the primary node, and after receiving the After the consent message of the second node, the operation of the first node to become the active node is performed.
  • the embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.
  • each module/unit in the above embodiment may be implemented in the form of hardware, for example, by implementing an integrated circuit to implement its corresponding function, or may be implemented in the form of a software function module, for example, executing a program stored in the memory by a processor. / instruction to achieve its corresponding function.
  • the invention is not limited to any specific form of combination of hardware and software.
  • the foregoing technical solution can implement fault detection and takeover between the active and standby nodes by the standby node without relying on the third-party arbitration, and provide a new management mode of the active/standby switchover, thereby achieving the purpose of providing high-availability services externally.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A method and device for managing active and backup nodes in a communication system and a high-availability cluster. The method comprises: detecting, by a backup node, whether an active node operates normally; upon the active node is detected to be not operated normally, triggering, by the backup node, execution of an active-backup switching operation.

Description

一种通信系统中管理主备节点的方法和装置及高可用集群Method and device for managing active and standby nodes in communication system and high availability cluster 技术领域Technical field
本文涉及但不限于通信领域,尤其涉及一种通信系统中管理主备节点的方法和装置及高可用集群。This document relates to, but is not limited to, the field of communications, and in particular, to a method and apparatus for managing active and standby nodes and a highly available cluster in a communication system.
背景技术Background technique
在大型商用软件系统中,为保证运行的稳定性,通常要求不能由于一台服务器的失效,导致整个服务中断,即应避免单点故障。典型地,通常由主备两台服务器组成一个集群,正常情况下由主用服务器对外提供服务,当主用服务器故障时,由从服务器(也称为备用服务器)接管并继续对外提供服务。主备用服务器之间的故障检测与接管过程是一个技术难点。目前典型的方法需要借助第三方仲裁来实现,即主备用服务器都定期向仲裁者报告自身状态,由仲裁者来判断是否达到了条件来触发故障接管过程。但是在实际应用中,如果仲裁者本身发生故障出现失效的情况,就无法正常完成故障接管。因此目前亟需一种新的主备管理机制对主备服务器进行管理。In large commercial software systems, in order to ensure the stability of operation, it is usually required that the entire service cannot be interrupted due to the failure of one server, that is, a single point of failure should be avoided. Typically, a server is usually composed of two servers, and the primary server is normally served by the primary server. When the primary server fails, the secondary server (also called the standby server) takes over and continues to provide services. The fault detection and takeover process between the primary and secondary servers is a technical difficulty. At present, the typical method needs to be implemented by means of third-party arbitration, that is, the primary and secondary servers regularly report their status to the arbitrator, and the arbitrator determines whether the condition is reached to trigger the fault takeover process. However, in practical applications, if the arbitrator itself fails, the failure takeover cannot be completed normally. Therefore, there is a need for a new active/standby management mechanism to manage the primary and secondary servers.
发明内容Summary of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.
本发明实施例提供一种通信系统中管理主备节点的方法和装置及高可用集群,能够提供一种新的主备切换的管理方式。The embodiments of the present invention provide a method and device for managing active and standby nodes in a communication system, and a high-availability cluster, which can provide a new management mode for active/standby switchover.
本发明实施例提供了如下技术方案:The embodiments of the present invention provide the following technical solutions:
一种通信系统中管理主备节点的方法,包括:A method for managing active and standby nodes in a communication system, comprising:
备用节点检测主用节点是否正常工作;The standby node detects whether the active node is working normally;
在检测到所述主用节点未正常工作后,所述备用节点触发主备切换操作的执行。 After detecting that the primary node is not working normally, the standby node triggers execution of the active/standby switching operation.
其中,所述备用节点检测主用节点是否正常工作,包括:The standby node detects whether the active node is working normally, including:
通过主用节点和备用节点之间的链路,所述备用节点检测是否能收到来自所述主用节点的心跳消息;Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。If the heartbeat message from the primary node is not received through the link, it is determined that the primary node has failed.
其中,所述方法还包括:The method further includes:
在检测到所述主用节点未正常工作后,在执行主备切换操作之前,所述备用节点在预先设置的等待时间内,继续检测主用节点是否正常工作;After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
如果在所述等待时间内所述主用节点未恢复正常工作,则所述备用节点执行主备切换操作。If the primary node does not resume normal operation during the waiting time, the standby node performs an active/standby switching operation.
其中,所述方法还包括:The method further includes:
如果在所述等待时间内所述主用节点恢复正常工作,则所述备用节点将已接收的业务请求转发给所述主用节点。If the primary node resumes normal operation during the waiting time, the standby node forwards the received service request to the primary node.
其中,所述方法还包括:The method further includes:
所述备用节点在所述等待时间内,如果接收到客户端发送的业务请求,则所述备用节点向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。If the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information that the service requesting the service request is currently unavailable.
一种通信系统中管理主备节点的装置,包括:A device for managing active and standby nodes in a communication system, comprising:
检测模块,设置为检测主用节点是否正常工作;The detection module is configured to detect whether the active node is working normally;
控制模块,设置为在检测到所述主用节点未正常工作后,触发主备切换操作的执行。The control module is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
其中,所述检测模块包括:The detection module includes:
第一检测单元,设置为通过主用节点和备用节点之间的链路,检测是否能收到来自所述主用节点的心跳消息;a first detecting unit, configured to detect whether a heartbeat message from the active node can be received through a link between the primary node and the standby node;
确定单元,设置为如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。A determining unit is configured to determine that the primary node has failed if a heartbeat message from the primary node is not received through the link.
其中,还包括: Among them, it also includes:
第二检测单元,设置为在检测到所述主用节点未正常工作后,在执行主备切换操作之前,在预先设置的等待时间内,继续检测通过主用节点是否正常工作;The second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
切换模块,设置为如果在所述等待时间内所述主用节点未恢复正常工作,则执行主备切换操作。The switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
其中,所述装置还包括:Wherein, the device further comprises:
第一发送单元,设置为如果在所述等待时间内所述主用节点恢复正常工作,则将已接收的业务请求转发给所述主用节点。The first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
其中,所述装置还包括:Wherein, the device further comprises:
第二发送单元,设置为在所述等待时间内,如果接收到客户端发送的业务请求,则向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。a second sending unit, configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
一种高可用集群,包括第一节点和上文任一所述装置的第二节点。A highly available cluster comprising a first node and a second node of any of the above devices.
其中,所述第一节点还设置为在第二节点发生主备切换成为主用节点后,如果所述第一节点恢复工作,则通知所述第二节点发起状态切换请求,并在接收到所述第二节点的同意消息后,执行所述第一节点成为主用节点的操作。The first node is further configured to notify the second node to initiate a state switching request after receiving the active/standby switchover to become the active node in the second node, and receive the state switch request. After the consent message of the second node is described, the operation of the first node to become the active node is performed.
本发明提供的实施例,在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。The embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.
在阅读并理解了附图和详细描述后,可以明白其他方面。Other aspects will be apparent upon reading and understanding the drawings and detailed description.
附图概述BRIEF abstract
图1为本发明实施例提供的通信系统中管理主备节点的方法的流程图;1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention;
图2为本发明实施例提供的客户端在实现主备节点的管理方法中的方法的流程图;2 is a flowchart of a method for a client to implement a method for managing a primary and a standby node according to an embodiment of the present invention;
图3为本发明实施例提供的主用服务器在实现主备节点的管理方法中的 方法的流程图;FIG. 3 is a schematic diagram of a method for managing a primary and a secondary node by using a primary server according to an embodiment of the present invention; Flow chart of the method;
图4为本发明实施例提供的备用服务器在实现主备节点的管理方法中的方法的流程图;4 is a flowchart of a method for implementing a method for managing a primary and a standby node by using a standby server according to an embodiment of the present invention;
图5为本发明实施例提供的通信系统中管理主备节点的装置的结构图。FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention.
本发明的实施方式Embodiments of the invention
下面将结合附图及具体实施例对本发明作进一步的详细描述。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。The invention will be further described in detail below with reference to the drawings and specific embodiments. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
图1为本发明实施例提供的通信系统中管理主备节点的方法的流程图。图1所示方法包括:FIG. 1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention. The method shown in Figure 1 includes:
步骤101、备用节点检测主用节点是否正常工作;Step 101: The standby node detects whether the active node is working normally.
步骤102、在检测所述主用节点未正常工作后,所述备用节点触发主备切换操作的执行。Step 102: After detecting that the active node is not working normally, the standby node triggers execution of an active/standby switchover operation.
本发明提供的方法实施例,在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。The method provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.
下面对本发明提供的方法实施例作进一步说明:The method embodiments provided by the present invention are further described below:
在启动主用节点和备用节点后,他们的初始服务状态均为非活跃状态,都不能对外提供服务。主用节点可以先主动向备用节点发起连接请求,在链路建立成功后,主用节点发送状态切换请求消息,消息中指出主用节点请求切换为活跃状态,此时备用节点因为也处于非活跃状态,认为主用节点此时可以立即切换为活跃状态,回复响应同意切换,主用节点收到响应后将自身服务状态置为活跃状态,开始对外提供服务。After the active node and the standby node are started, their initial service status is inactive and cannot be served externally. The active node may initiate a connection request to the standby node actively. After the link is successfully established, the active node sends a status switch request message, indicating that the active node requests to switch to the active state, and the standby node is also inactive. The state assumes that the active node can immediately switch to the active state at this time, and the reply response agrees to switch. After receiving the response, the active node sets its own service state to the active state and starts to provide services externally.
其中,所述备用节点检测主用节点是否正常工作,包括:The standby node detects whether the active node is working normally, including:
通过主用节点和备用节点之间的链路,所述备用节点检测是否能收到来自所述主用节点的心跳消息; Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。If the heartbeat message from the primary node is not received through the link, it is determined that the primary node has failed.
其中,来自所述主用节点的消息可以是主用节点主动发送的,也可以是对备用节点发送的消息的响应消息。The message from the active node may be sent by the active node or may be a response message to the message sent by the standby node.
由上可以看出,借助主备节点之间的链接来检测主用节点是否正常工作,实现简单方便。It can be seen from the above that the link between the active and standby nodes is used to detect whether the active node is working normally, and the implementation is simple and convenient.
所述方法还包括:The method further includes:
在检测到所述主用节点未正常工作后,在执行主备切换操作之前,所述备用节点在预先设置的等待时间内,继续检测通过主用节点是否正常工作;After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
如果在所述等待时间内所述主用节点未恢复正常工作,则所述备用节点执行主备切换操作。If the primary node does not resume normal operation during the waiting time, the standby node performs an active/standby switching operation.
在实际应用中,由于节点在工作过程中极有可能发生短时的故障,如果该故障能够很快的解决,就不需要发起主备切换,避免引起业务处理的迁移,延迟业务的处理进度,因此通过设置一个等待时间,为主用节点解决自身故障提供了一段时间,从而降低业务处理的迁移的可能性,保证数据处理的进度,提高系统的稳定性。In practical applications, since a node is likely to have a short-term fault during the working process, if the fault can be quickly resolved, it is not necessary to initiate an active/standby switchover to avoid the migration of service processing and delay the processing progress of the service. Therefore, by setting a waiting time, the primary node provides a period of time to solve its own fault, thereby reducing the possibility of migration of the business processing, ensuring the progress of the data processing, and improving the stability of the system.
另外,所述备用节点在所述等待时间内,如果接收到客户端发送的业务请求,则所述备用节点向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。In addition, if the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information for the service that is used to process the service request is currently unavailable.
通过告知发起业务请求的客户端当前业务不可用,方便客户端获知节点的处理能力,为客户端后续的操作提供依据。By notifying the client that the service request is initiated that the current service is unavailable, it is convenient for the client to know the processing capability of the node, and provide a basis for subsequent operations of the client.
当然,如果在所述等待时间内所述主用节点恢复正常工作,则所述备用节点将已接收的业务请求转发给所述主用节点。Of course, if the primary node resumes normal operation within the waiting time, the standby node forwards the received service request to the primary node.
下面以节点为服务器为例进行说明:The following takes the node as the server as an example:
在对本发明实施例提供的方法进行说明前,首先对本发明的方法的应用场景作以简单说明: Before the method provided by the embodiment of the present invention is described, the application scenario of the method of the present invention is first described briefly:
一种网络通信系统,包括:一台主用服务器,一台备用服务器,一个或多个客户端,其中每个客户端与主备用服务器之间分别存在一条通信链路,主用服务器与备用服务器之间存在一条通信链路。主用服务器通过一个物理网络接口对外通信,其中主用服务器对外通信包括与备用服务器和其它一个或多个客户端的通信,其中主用服务器具有唯一的IP地址;备用服务器通过一个物理网络接口对外通信,其中备用服务器对外通信包括与主用服务器和其它一个或多个客户端的通信,备用服务器也具有唯一的IP地址,且不同于主用服务器的IP地址。因此如果是主用服务器故障离线时,其与备用服务器和所有客户端的通信链接将会断开。如果是备用服务器故障离线时,其与主用服务器和所有客户端的通信链接将会断开。A network communication system includes: a primary server, a standby server, and one or more clients, wherein each client and the primary backup server respectively have a communication link, the primary server and the standby server There is a communication link between them. The primary server communicates externally through a physical network interface, wherein the primary server external communication includes communication with the standby server and other one or more clients, wherein the primary server has a unique IP address; the standby server communicates externally through a physical network interface. The standby server external communication includes communication with the primary server and other one or more clients, and the standby server also has a unique IP address and is different from the IP address of the primary server. Therefore, if the primary server fails offline, its communication link with the standby server and all clients will be disconnected. If the standby server fails offline, its communication link with the primary server and all clients will be broken.
图2至图4为依次客户端、主用服务器和备用服务器在实现主备节点的管理方法中的方法的流程图。其中图2至图4的说明,具体如下:2 to FIG. 4 are flowcharts of a method in which a client, a primary server, and a standby server perform a management method of a primary and a secondary node in sequence. The description of Figures 2 to 4 is as follows:
主备用服务器之间的故障检测与接管依赖于对于当前对外链接数量的计算及与对端服务器链路是否存在的判断。The fault detection and takeover between the primary and secondary servers depends on the calculation of the current number of external links and the determination of the existence of the peer server link.
链路映射表:用于保存当前主机的所有外部通信链接信息,其中key值可使用能唯一标识通信对端的识别标志,如对端的IP地址+端口,value值为最近一次收到心跳或心跳响应消息的时间。Link mapping table: used to save all external communication link information of the current host. The key value can use the identifier that uniquely identifies the communication peer, such as the IP address of the peer + port. The value is the last received heartbeat or heartbeat response. The time of the message.
链路数量的计算:Calculation of the number of links:
通信中由通信客户端定时发送心跳消息给通信服务端,而后通信服务端回复心跳响应消息给通信客户端。通信服务端收到心跳消息后即认为一条链路已经建立,在链路映射表中增加一条记录,通信服务端链路数量增加1。与此同时,通信客户端收到心跳响应消息后也认为一条链路已经建立成功,在链路映射表中增加一条记录,通信客户端链路数量增加1。In the communication, the communication client periodically sends a heartbeat message to the communication server, and then the communication server returns a heartbeat response message to the communication client. After receiving the heartbeat message, the communication server considers that a link has been established, adds a record in the link mapping table, and increases the number of links on the communication server by one. At the same time, after receiving the heartbeat response message, the communication client also considers that a link has been successfully established, and adds a record in the link mapping table, and the number of communication client links increases by one.
如果链路已经中断,则通信服务端收不到来自同一通信客户端的心跳消息,在超过一定时间(时间可配置)后,通信服务端认为该链路已经断开,将该记录从其链路映射表中移除,服务端链路数量减少1。同样地,通信客户端收不到来自通信服务端的心跳响应消息,在超过一定时间(时间可配置)后,通信客户端认为该链路已经断开,将该记录从其链路映射表中移除,通信客户端链路数量减少1。 If the link has been interrupted, the communication server does not receive a heartbeat message from the same communication client. After a certain period of time (time configurable), the communication server considers that the link has been disconnected, and records the record from its link. Removed from the mapping table, the number of server links is reduced by 1. Similarly, the communication client does not receive the heartbeat response message from the communication server. After a certain time (time configurable), the communication client considers that the link has been disconnected, and moves the record from its link mapping table. In addition, the number of communication client links is reduced by one.
为方便说明:为方案涉及到的三种角色,客户端,主用服务器,备用服务器分别设置下列参数:For convenience of explanation: For the three roles involved in the solution, the client, the primary server, and the standby server respectively set the following parameters:
1.客户端:1. Client:
与主用服务器通信链路是否正常(查找链路映射表是否有对应主用服务器的记录)Whether the communication link with the primary server is normal (look for the link mapping table for the record corresponding to the primary server)
与备用服务器通信链路是否正常(查找链路映射表是否有对应备用服务器的记录)Whether the communication link with the standby server is normal (it finds whether the link mapping table has a record corresponding to the standby server)
2.主用服务器:2. The main server:
链接数量(链路映射表当前记录的数量)Number of links (number of current records in the link map)
与备用服务器通信链路是否正常(查找链路映射表是否有对应备用服务器的记录)Whether the communication link with the standby server is normal (it finds whether the link mapping table has a record corresponding to the standby server)
服务状态(活跃或非活跃)Service status (active or inactive)
3.备用服务器:3. Standby server:
链接数量(链路映射表当前记录的数量)Number of links (number of current records in the link map)
与主用服务器通信链路是否正常(查找链路映射表是否有对应主用服务器的记录)Whether the communication link with the primary server is normal (look for the link mapping table for the record corresponding to the primary server)
服务状态(活跃或非活跃)Service status (active or inactive)
客户端会向主(备)服务器发送业务请求消息,主(备)服务器会返回响应消息。The client sends a service request message to the primary (standby) server, and the primary (standby) server returns a response message.
主用服务器会向备用服务器发送状态切换请求消息,备用服务器会返回响应消息。The primary server sends a status switch request message to the standby server, and the standby server returns a response message.
上述两种响应消息格式中应包括错误码,如响应消息格式为:错误码+响应消息内容,错误码主要用于判断请求操作是否被成功处理,该请求是否需要被重新发送。The above two response message formats should include an error code. For example, the response message format is: error code + response message content, and the error code is mainly used to determine whether the request operation is successfully processed, and whether the request needs to be resent.
主用服务器和备用服务器之间的通信约定由一方作为通信客户端主动向对方发起连接请求,我们假定由主用服务器主动向备用服务器发起连接请求,主用服务器和备用服务器之间只有一条通信链路。 The communication agreement between the primary server and the standby server is initiated by the party as the communication client to initiate a connection request to the other party. We assume that the primary server actively initiates a connection request to the standby server, and there is only one communication link between the primary server and the standby server. road.
步骤一:分别启动主用服务器和备用服务器,他们的初始服务状态均为非活跃状态,都不能对外提供服务。主用服务器先主动向备用服务器发起连接请求,在链路建立成功后,主用服务器发送状态切换请求消息,消息中指出主用服务器请求切换为活跃状态,此时备用服务器因为也处于非活跃状态,认为主用服务器此时可以立即切换为活跃状态,回复响应同意切换,主用服务器收到响应后将自身服务状态置为活跃状态,开始对外提供服务。Step 1: Start the primary server and the standby server respectively. Their initial service status is inactive and cannot provide services externally. The primary server initiates a connection request to the standby server. After the link is successfully established, the primary server sends a status switch request message, indicating that the primary server requests the switch to be in an active state, and the standby server is also in an inactive state. It is considered that the active server can immediately switch to the active state at this time, and the response response agrees to switch. After receiving the response, the primary server sets its own service state to an active state and starts to provide external services.
步骤二:客户端向主或备用服务器发送具体业务消息,并接收响应消息,响应消息中会包括一个错误码,该错误码用于标识该请求消息是否真正得到了处理。Step 2: The client sends a specific service message to the primary or secondary server, and receives a response message. The response message includes an error code, and the error code is used to identify whether the request message is actually processed.
客户端访问时基于以下原则,如果与主用服务器的链路正常,则将请求消息向主用服务器发送,反之则向备用服务器发送。当主或备用服务器收到客户端请求时,如果服务状态为非活跃状态,则向客户端回复服务不可用错误码。除非客户端与主备用服务器的链路都断开,否则客户端需要不断重试发送请求消息,直至收到其它错误码时,则表示请求消息已经被成功处理,可从响应消息中解析具体业务请求的处理结果。可将重试相关逻辑封装到一个API中供上层应用调用,上层应用无须关心重试等通信细节。Client access is based on the following principle: If the link with the primary server is normal, the request message is sent to the primary server, and vice versa. When the primary or secondary server receives a client request, if the service status is inactive, the client is replied to the service unavailable error code. Unless the link between the client and the primary and secondary servers is disconnected, the client needs to continually retry sending the request message until the other error code is received, indicating that the request message has been successfully processed, and the specific service can be parsed from the response message. The result of the request processing. The retry related logic can be encapsulated into an API for upper layer application calls, and the upper layer application does not need to care about communication details such as retry.
步骤三:如果主用服务器故障离线,它与客户端和备用服务器的链路将会断开。备用服务器检测到主用服务器的链路断开后,立即设定一个等待时间(可配置),等待与主用服务器的链路恢复,如果在这个时间内,链路恢复,则会重新收到主用服务器的状态切换请求消息,直接回复同意,整个系统恢复到原样。但如果超过了这个时间,与主用服务器的链路仍未恢复,则备用服务器将自身状态置为活跃状态,完成故障接管。在这个过程中,客户端初始检测到主用服务器的链路不可用,只能将请求发送给备用服务器,备用服务器在状态切换为活跃之前,会一直回复给客户端服务不可用错误码,切换为活跃状态之后,处理完业务请求后回复响应,响应中包含其它错误码(非服务不可用)。如果主用服务器在这期间恢复,则客户端转而向其发送请求,直到收到包括非服务不可用错误码的响应消息。Step 3: If the primary server fails offline, its link to the client and backup server will be disconnected. After the standby server detects that the link of the primary server is disconnected, it immediately sets a waiting time (configurable) and waits for the link recovery with the primary server. If the link is restored within this time, it will be received again. The status switch request message of the primary server directly agrees and the entire system is restored to the original state. However, if this time is exceeded and the link with the primary server is still not restored, the standby server sets its own state to the active state and completes the failover. In this process, the client initially detects that the link of the primary server is unavailable, and can only send the request to the standby server. The standby server will always reply to the client service unavailable error code before switching the state to active. After the active state, the response is returned after the service request is processed, and the response contains other error codes (non-service unavailable). If the primary server recovers during this time, the client then sends a request to it until it receives a response message including the non-service unavailable error code.
步骤四:如果主用服务器故障离线,备用服务器已经实现接管切换为活跃状态。此时如果主用服务器修复后重新上线,主用服务器会向备用服务器发 送状态切换请求消息,此时备用服务器置为非活跃状态,但此时可能正在处理客户端的业务请求,需要等待现有请求处理完成,所以不能立即同意其切换请求,回复不同意,这时如果有新的业务请求发送到备用服务器,备用服务器回复服务不可用。直到所有当前的业务请求处理完毕后,回复主用服务器同意其状态切换请求。主用服务器初始在收到备用服务器不同意其状态切换的响应时,会不断重发状态切换请求消息,直至收到备用服务器的同意响应为止。客户端在这个过程中新增业务请求需要向主用服务器发送,如果收到错误码为服务不可用时需要重试,直到收到包含其它错误码的响应。Step 4: If the primary server fails offline, the standby server has been switched to the active state. At this point, if the primary server is repaired and then back online, the primary server will send the secondary server to the standby server. The status switch request message is sent. At this time, the standby server is inactive, but the service request of the client may be being processed at this time, and the existing request processing needs to be completed, so the switch request cannot be immediately agreed, and the reply does not agree. A new service request is sent to the alternate server, and the alternate server reply service is unavailable. After all current business requests have been processed, the replying primary server agrees to its state switching request. When the primary server initially receives a response from the standby server that does not agree with its state switch, it will continuously resend the state switch request message until it receives the consent response from the standby server. The client needs to send a new service request to the primary server during this process. If the error code is received, the service needs to be retried until it receives a response containing other error codes.
本发明实施例还提供了一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行上述方法。The embodiment of the invention further provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are used to execute the above method.
图5为本发明实施例提供的种通信系统中管理主备节点的装置的结构图。图5所示装置包括:FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention. The device shown in Figure 5 includes:
检测模块501,设置为检测主用节点是否正常工作;The detecting module 501 is configured to detect whether the active node is working normally;
控制模块502,设置为在检测所述主用节点未正常工作后,触发主备切换操作的执行。The control module 502 is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
其中,所述检测模块501包括:The detecting module 501 includes:
第一检测单元,设置为通过主用节点和备用节点之间的链路,检测是否能收到来自所述主用节点的消息;a first detecting unit configured to detect whether a message from the active node can be received through a link between the primary node and the standby node;
确定单元,设置为如果通过所述链路未收到来自所述主用节点的消息,则确定所述主用节点发生故障。A determining unit is configured to determine that the primary node has failed if a message from the primary node is not received over the link.
其中,所述装置还包括:Wherein, the device further comprises:
第二检测单元,设置为在检测到所述主用节点未正常工作后,在执行主备切换操作之前,在预先设置的等待时间内,继续检测通过主用节点是否正常工作;The second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
切换模块,设置为如果在所述等待时间内所述主用节点未恢复正常工作,则执行主备切换操作。 The switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
其中,所述装置还包括:Wherein, the device further comprises:
第一发送单元,设置为如果在所述等待时间内所述主用节点恢复正常工作,则将已接收的业务请求转发给所述主用节点。The first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
其中,所述装置还包括:Wherein, the device further comprises:
第二发送单元,设置为在所述等待时间内,如果接收到客户端发送的业务请求,则向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。a second sending unit, configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
本发明提供的装置实施例,在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。The device embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.
另外,本发明实施例提供一种高可用集群,包括第一节点和包括图5所示装置的第二节点。In addition, an embodiment of the present invention provides a high availability cluster, including a first node and a second node including the apparatus shown in FIG. 5.
其中,所述第一节点设置为在第二节点发生主备切换成为主用节点后,如果所述第一节点恢复工作,则通知所述第二节点发起状态切换请求,并在接收到所述第二节点的同意消息后,执行所述第一节点成为主用节点的操作。The first node is configured to notify the second node to initiate a state switching request if the first node resumes working after the primary node is switched to become the primary node, and after receiving the After the consent message of the second node, the operation of the first node to become the active node is performed.
本发明提供的实施例,在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。The embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件(例如处理器)完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,例如通过集成电路来实现其相应功能,也可以采用软件功能模块的形式实现,例如通过处理器执行存储于存储器中的程序/指令来实现其相应功能。本发明不限制于任何特定形式的硬件和软件的结合。 One of ordinary skill in the art will appreciate that all or a portion of the above steps may be performed by a program to instruct related hardware, such as a processor, which may be stored in a computer readable storage medium, such as a read only memory, disk or optical disk. Wait. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the above embodiment may be implemented in the form of hardware, for example, by implementing an integrated circuit to implement its corresponding function, or may be implemented in the form of a software function module, for example, executing a program stored in the memory by a processor. / instruction to achieve its corresponding function. The invention is not limited to any specific form of combination of hardware and software.
本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的精神和范围,均应涵盖在本发明的权利要求范围当中。It should be understood by those skilled in the art that the present invention may be modified or equivalently substituted without departing from the spirit and scope of the invention.
工业实用性Industrial applicability
上述技术方案可实现在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。 The foregoing technical solution can implement fault detection and takeover between the active and standby nodes by the standby node without relying on the third-party arbitration, and provide a new management mode of the active/standby switchover, thereby achieving the purpose of providing high-availability services externally.

Claims (12)

  1. 一种通信系统中管理主备节点的方法,包括:A method for managing active and standby nodes in a communication system, comprising:
    备用节点检测主用节点是否正常工作;The standby node detects whether the active node is working normally;
    在检测到所述主用节点未正常工作后,所述备用节点触发主备切换操作的执行。After detecting that the primary node is not working normally, the standby node triggers execution of the active/standby switching operation.
  2. 根据权利要求1所述的方法,其中,所述备用节点检测主用节点是否正常工作,包括:The method of claim 1, wherein the standby node detects whether the active node is working properly, including:
    通过主用节点和备用节点之间的链路,所述备用节点检测是否能收到来自所述主用节点的心跳消息;Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
    如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。If the heartbeat message from the primary node is not received through the link, it is determined that the primary node has failed.
  3. 根据权利要求1所述的方法,所述方法还包括:The method of claim 1 further comprising:
    在检测到所述主用节点未正常工作后,在执行主备切换操作之前,所述备用节点在预先设置的等待时间内,继续检测主用节点是否正常工作;After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
    如果在所述等待时间内所述主用节点未恢复正常工作,则所述备用节点执行主备切换操作。If the primary node does not resume normal operation during the waiting time, the standby node performs an active/standby switching operation.
  4. 根据权利要求3所述的方法,所述方法还包括:The method of claim 3, further comprising:
    如果在所述等待时间内所述主用节点恢复正常工作,则所述备用节点将已接收的业务请求转发给所述主用节点。If the primary node resumes normal operation during the waiting time, the standby node forwards the received service request to the primary node.
  5. 根据权利要求3所述的方法,所述方法还包括:The method of claim 3, further comprising:
    所述备用节点在所述等待时间内,如果接收到客户端发送的业务请求,则所述备用节点向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。If the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information that the service requesting the service request is currently unavailable.
  6. 一种通信系统中管理主备节点的装置,包括:A device for managing active and standby nodes in a communication system, comprising:
    检测模块,设置为检测主用节点是否正常工作; The detection module is configured to detect whether the active node is working normally;
    控制模块,设置为在检测到所述主用节点未正常工作后,触发主备切换操作的执行。The control module is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
  7. 根据权利要求6所述的装置,其中,所述检测模块包括:The apparatus of claim 6 wherein said detecting module comprises:
    第一检测单元,设置为通过主用节点和备用节点之间的链路,检测是否能收到来自所述主用节点的心跳消息;a first detecting unit, configured to detect whether a heartbeat message from the active node can be received through a link between the primary node and the standby node;
    确定单元,设置为如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。A determining unit is configured to determine that the primary node has failed if a heartbeat message from the primary node is not received through the link.
  8. 根据权利要求6所述的装置,还包括:The apparatus of claim 6 further comprising:
    第二检测单元,设置为在检测到所述主用节点未正常工作后,在执行主备切换操作之前,在预先设置的等待时间内,继续检测通过主用节点是否正常工作;The second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
    切换模块,设置为如果在所述等待时间内所述主用节点未恢复正常工作,则执行主备切换操作。The switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
  9. 根据权利要求8所述的装置,还包括:The apparatus of claim 8 further comprising:
    第一发送单元,设置为如果在所述等待时间内所述主用节点恢复正常工作,则将已接收的业务请求转发给所述主用节点。The first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
  10. 根据权利要求8所述的装置,还包括:The apparatus of claim 8 further comprising:
    第二发送单元,设置为在所述等待时间内,如果接收到客户端发送的业务请求,则向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。a second sending unit, configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
  11. 一种高可用集群,包括第一节点和包括如权利要求6至10任一所述装置的第二节点。A highly available cluster comprising a first node and a second node comprising the apparatus of any of claims 6-10.
  12. 根据权利要求11所述的高可用集群,其中,所述第一节点设置为在第二节点发生主备切换成为主用节点后,如果所述第一节点恢复工作,则通知所述第二节点发起状态切换请求,并在接收到所述第二节点的同意消息后,执行所述第一节点成为主用节点的操作。 The high availability cluster according to claim 11, wherein the first node is configured to notify the second node if the first node resumes operation after the primary node switches to become the primary node. And initiating a state switching request, and after receiving the consent message of the second node, performing an operation that the first node becomes a primary node.
PCT/CN2016/078490 2015-06-15 2016-04-05 Method and device for managing active and backup nodes in communication system and high-availability cluster WO2016202051A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510331124.2A CN106330475B (en) 2015-06-15 2015-06-15 Method and device for managing main and standby nodes in communication system and high-availability cluster
CN201510331124.2 2015-06-15

Publications (1)

Publication Number Publication Date
WO2016202051A1 true WO2016202051A1 (en) 2016-12-22

Family

ID=57544964

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/078490 WO2016202051A1 (en) 2015-06-15 2016-04-05 Method and device for managing active and backup nodes in communication system and high-availability cluster

Country Status (2)

Country Link
CN (1) CN106330475B (en)
WO (1) WO2016202051A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112787917A (en) * 2019-11-11 2021-05-11 中兴通讯股份有限公司 Protection method, end node, protection group network and storage medium for flexible Ethernet
CN114257500A (en) * 2021-12-24 2022-03-29 苏州浪潮智能科技有限公司 Fault switching method, system and device for internal network of super-converged cluster
CN114466391A (en) * 2022-03-21 2022-05-10 中国电信股份有限公司 Network element equipment state updating method and device, storage medium and electronic equipment
CN116582618A (en) * 2023-07-13 2023-08-11 天津金城银行股份有限公司 Method and device for realizing high availability of electric pin, machine room management platform and computer

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106911524B (en) * 2017-04-27 2020-07-07 新华三信息技术有限公司 HA implementation method and device
CN107528724B (en) * 2017-07-20 2020-09-29 奇安信科技集团股份有限公司 Optimization processing method and device for node cluster
CN109428740B (en) * 2017-08-21 2020-09-08 华为技术有限公司 Method and device for recovering equipment failure
CN108023775A (en) * 2017-12-07 2018-05-11 湖北三新文化传媒有限公司 High-availability cluster architecture system and method
CN108023891A (en) * 2017-12-12 2018-05-11 北京安博通科技股份有限公司 A kind of tunnel switching method based on IPSEC, device and gateway
CN109101367A (en) * 2018-08-15 2018-12-28 郑州云海信息技术有限公司 The management method and device of component in cloud computing system
CN109344015B (en) * 2018-10-10 2022-05-24 武汉达梦数据库股份有限公司 Method and system for preventing double main nodes by using HA (home agent) for database service
CN110300023A (en) * 2019-06-28 2019-10-01 上海智臻智能网络科技股份有限公司 A kind of state switching method, device, node, node group and storage medium
CN115134219A (en) * 2022-06-29 2022-09-30 北京飞讯数码科技有限公司 Device resource management method and device, computing device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040034807A1 (en) * 2002-08-14 2004-02-19 Gnp Computers, Inc. Roving servers in a clustered telecommunication distributed computer system
CN101039172A (en) * 2007-05-15 2007-09-19 华为技术有限公司 Ethernet ring network system and its protection method and standby host node
CN101179432A (en) * 2007-12-13 2008-05-14 浪潮电子信息产业股份有限公司 Method of implementing high availability of system in multi-machine surroundings
CN101335702A (en) * 2008-07-07 2008-12-31 中兴通讯股份有限公司 Disaster recovery method of serving GPRS support node
CN102118309A (en) * 2010-12-31 2011-07-06 中国科学院计算技术研究所 Method and system for double-machine hot backup

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015023458A (en) * 2013-07-19 2015-02-02 富士通株式会社 Communication system, redundancy control method in communication system, and transmission device
CN103490969B (en) * 2013-09-17 2016-07-06 烽火通信科技股份有限公司 Realize the system and method for VPWS redundancy protecting Fast Convergent

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040034807A1 (en) * 2002-08-14 2004-02-19 Gnp Computers, Inc. Roving servers in a clustered telecommunication distributed computer system
CN101039172A (en) * 2007-05-15 2007-09-19 华为技术有限公司 Ethernet ring network system and its protection method and standby host node
CN101179432A (en) * 2007-12-13 2008-05-14 浪潮电子信息产业股份有限公司 Method of implementing high availability of system in multi-machine surroundings
CN101335702A (en) * 2008-07-07 2008-12-31 中兴通讯股份有限公司 Disaster recovery method of serving GPRS support node
CN102118309A (en) * 2010-12-31 2011-07-06 中国科学院计算技术研究所 Method and system for double-machine hot backup

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112787917A (en) * 2019-11-11 2021-05-11 中兴通讯股份有限公司 Protection method, end node, protection group network and storage medium for flexible Ethernet
CN114257500A (en) * 2021-12-24 2022-03-29 苏州浪潮智能科技有限公司 Fault switching method, system and device for internal network of super-converged cluster
CN114257500B (en) * 2021-12-24 2023-06-09 苏州浪潮智能科技有限公司 Fault switching method, system and device for super-fusion cluster internal network
CN114466391A (en) * 2022-03-21 2022-05-10 中国电信股份有限公司 Network element equipment state updating method and device, storage medium and electronic equipment
CN116582618A (en) * 2023-07-13 2023-08-11 天津金城银行股份有限公司 Method and device for realizing high availability of electric pin, machine room management platform and computer
CN116582618B (en) * 2023-07-13 2023-10-10 天津金城银行股份有限公司 Method and device for realizing high availability of electric pin, machine room management platform and computer

Also Published As

Publication number Publication date
CN106330475B (en) 2020-12-04
CN106330475A (en) 2017-01-11

Similar Documents

Publication Publication Date Title
WO2016202051A1 (en) Method and device for managing active and backup nodes in communication system and high-availability cluster
US11163653B2 (en) Storage cluster failure detection
US6952766B2 (en) Automated node restart in clustered computer system
US6983324B1 (en) Dynamic modification of cluster communication parameters in clustered computer system
US11330071B2 (en) Inter-process communication fault detection and recovery system
JP5863942B2 (en) Provision of witness service
US20080288812A1 (en) Cluster system and an error recovery method thereof
JP2010045760A (en) Connection recovery device for redundant system, method and processing program
WO2012097588A1 (en) Data storage method, apparatus and system
US11889330B2 (en) Methods and related devices for implementing disaster recovery
US20140359340A1 (en) Subscriptions that indicate the presence of application servers
WO2017215430A1 (en) Node management method in cluster and node device
WO2016107443A1 (en) Snapshot processing method and related device
TW200920027A (en) Intelligent failover in a load-balanced networking environment
WO2017071384A1 (en) Message processing method and apparatus
AU2014321418A1 (en) Email webclient notification queuing
CN109189854B (en) Method and node equipment for providing continuous service
CN108200151B (en) ISCSI Target load balancing method and device in distributed storage system
US20130185425A1 (en) Method for Optimizing Network Performance After A Temporary Loss of Connection
CN110351122B (en) Disaster recovery method, device, system and electronic equipment
CN110661599B (en) HA implementation method, device and storage medium between main node and standby node
CN113596195B (en) Public IP address management method, device, main node and storage medium
WO2021238579A1 (en) Method for managing sata hard disk by means of storage system, and storage system
JP2009075710A (en) Redundant system
JP2007141129A (en) System switching method, computer system and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16810792

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16810792

Country of ref document: EP

Kind code of ref document: A1