WO2016202051A1 - Method and device for managing active and backup nodes in communication system and high-availability cluster - Google Patents
Method and device for managing active and backup nodes in communication system and high-availability cluster Download PDFInfo
- Publication number
- WO2016202051A1 WO2016202051A1 PCT/CN2016/078490 CN2016078490W WO2016202051A1 WO 2016202051 A1 WO2016202051 A1 WO 2016202051A1 CN 2016078490 W CN2016078490 W CN 2016078490W WO 2016202051 A1 WO2016202051 A1 WO 2016202051A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- standby
- active
- primary
- service
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0668—Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
Definitions
- This document relates to, but is not limited to, the field of communications, and in particular, to a method and apparatus for managing active and standby nodes and a highly available cluster in a communication system.
- a server is usually composed of two servers, and the primary server is normally served by the primary server.
- the secondary server also called the standby server
- the fault detection and takeover process between the primary and secondary servers is a technical difficulty.
- the typical method needs to be implemented by means of third-party arbitration, that is, the primary and secondary servers regularly report their status to the arbitrator, and the arbitrator determines whether the condition is reached to trigger the fault takeover process.
- the arbitrator itself fails, the failure takeover cannot be completed normally. Therefore, there is a need for a new active/standby management mechanism to manage the primary and secondary servers.
- the embodiments of the present invention provide a method and device for managing active and standby nodes in a communication system, and a high-availability cluster, which can provide a new management mode for active/standby switchover.
- a method for managing active and standby nodes in a communication system comprising:
- the standby node detects whether the active node is working normally
- the standby node After detecting that the primary node is not working normally, the standby node triggers execution of the active/standby switching operation.
- the standby node detects whether the active node is working normally, including:
- the standby node Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
- the method further includes:
- the standby node After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
- the standby node performs an active/standby switching operation.
- the method further includes:
- the standby node forwards the received service request to the primary node.
- the method further includes:
- the standby node If the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information that the service requesting the service request is currently unavailable.
- a device for managing active and standby nodes in a communication system comprising:
- the detection module is configured to detect whether the active node is working normally
- the control module is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
- the detection module includes:
- a first detecting unit configured to detect whether a heartbeat message from the active node can be received through a link between the primary node and the standby node;
- a determining unit is configured to determine that the primary node has failed if a heartbeat message from the primary node is not received through the link.
- the second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
- the switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
- the device further comprises:
- the first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
- the device further comprises:
- a second sending unit configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
- a highly available cluster comprising a first node and a second node of any of the above devices.
- the first node is further configured to notify the second node to initiate a state switching request after receiving the active/standby switchover to become the active node in the second node, and receive the state switch request. After the consent message of the second node is described, the operation of the first node to become the active node is performed.
- the embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.
- FIG. 1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention
- FIG. 2 is a flowchart of a method for a client to implement a method for managing a primary and a standby node according to an embodiment of the present invention
- FIG. 3 is a schematic diagram of a method for managing a primary and a secondary node by using a primary server according to an embodiment of the present invention. Flow chart of the method;
- FIG. 4 is a flowchart of a method for implementing a method for managing a primary and a standby node by using a standby server according to an embodiment of the present invention
- FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention.
- FIG. 1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention. The method shown in Figure 1 includes:
- Step 101 The standby node detects whether the active node is working normally.
- Step 102 After detecting that the active node is not working normally, the standby node triggers execution of an active/standby switchover operation.
- the method provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.
- the active node may initiate a connection request to the standby node actively. After the link is successfully established, the active node sends a status switch request message, indicating that the active node requests to switch to the active state, and the standby node is also inactive. The state assumes that the active node can immediately switch to the active state at this time, and the reply response agrees to switch. After receiving the response, the active node sets its own service state to the active state and starts to provide services externally.
- the standby node detects whether the active node is working normally, including:
- the standby node Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
- the message from the active node may be sent by the active node or may be a response message to the message sent by the standby node.
- the method further includes:
- the standby node After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
- the standby node performs an active/standby switching operation.
- the primary node provides a period of time to solve its own fault, thereby reducing the possibility of migration of the business processing, ensuring the progress of the data processing, and improving the stability of the system.
- the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information for the service that is used to process the service request is currently unavailable.
- the standby node forwards the received service request to the primary node.
- a network communication system includes: a primary server, a standby server, and one or more clients, wherein each client and the primary backup server respectively have a communication link, the primary server and the standby server There is a communication link between them.
- the primary server communicates externally through a physical network interface, wherein the primary server external communication includes communication with the standby server and other one or more clients, wherein the primary server has a unique IP address; the standby server communicates externally through a physical network interface.
- the standby server external communication includes communication with the primary server and other one or more clients, and the standby server also has a unique IP address and is different from the IP address of the primary server. Therefore, if the primary server fails offline, its communication link with the standby server and all clients will be disconnected. If the standby server fails offline, its communication link with the primary server and all clients will be broken.
- FIG. 4 are flowcharts of a method in which a client, a primary server, and a standby server perform a management method of a primary and a secondary node in sequence.
- the description of Figures 2 to 4 is as follows:
- the fault detection and takeover between the primary and secondary servers depends on the calculation of the current number of external links and the determination of the existence of the peer server link.
- Link mapping table used to save all external communication link information of the current host.
- the key value can use the identifier that uniquely identifies the communication peer, such as the IP address of the peer + port.
- the value is the last received heartbeat or heartbeat response. The time of the message.
- the communication client periodically sends a heartbeat message to the communication server, and then the communication server returns a heartbeat response message to the communication client.
- the communication server After receiving the heartbeat message, the communication server considers that a link has been established, adds a record in the link mapping table, and increases the number of links on the communication server by one.
- the communication client After receiving the heartbeat response message, the communication client also considers that a link has been successfully established, and adds a record in the link mapping table, and the number of communication client links increases by one.
- the communication server does not receive a heartbeat message from the same communication client. After a certain period of time (time configurable), the communication server considers that the link has been disconnected, and records the record from its link. Removed from the mapping table, the number of server links is reduced by 1. Similarly, the communication client does not receive the heartbeat response message from the communication server. After a certain time (time configurable), the communication client considers that the link has been disconnected, and moves the record from its link mapping table. In addition, the number of communication client links is reduced by one.
- the client For convenience of explanation: For the three roles involved in the solution, the client, the primary server, and the standby server respectively set the following parameters:
- the main server :
- the client sends a service request message to the primary (standby) server, and the primary (standby) server returns a response message.
- the primary server sends a status switch request message to the standby server, and the standby server returns a response message.
- the above two response message formats should include an error code.
- the response message format is: error code + response message content, and the error code is mainly used to determine whether the request operation is successfully processed, and whether the request needs to be resent.
- the communication agreement between the primary server and the standby server is initiated by the party as the communication client to initiate a connection request to the other party.
- the primary server actively initiates a connection request to the standby server, and there is only one communication link between the primary server and the standby server. road.
- Step 1 Start the primary server and the standby server respectively. Their initial service status is inactive and cannot provide services externally.
- the primary server initiates a connection request to the standby server. After the link is successfully established, the primary server sends a status switch request message, indicating that the primary server requests the switch to be in an active state, and the standby server is also in an inactive state. It is considered that the active server can immediately switch to the active state at this time, and the response response agrees to switch. After receiving the response, the primary server sets its own service state to an active state and starts to provide external services.
- Step 2 The client sends a specific service message to the primary or secondary server, and receives a response message.
- the response message includes an error code, and the error code is used to identify whether the request message is actually processed.
- Client access is based on the following principle: If the link with the primary server is normal, the request message is sent to the primary server, and vice versa. When the primary or secondary server receives a client request, if the service status is inactive, the client is replied to the service unavailable error code. Unless the link between the client and the primary and secondary servers is disconnected, the client needs to continually retry sending the request message until the other error code is received, indicating that the request message has been successfully processed, and the specific service can be parsed from the response message. The result of the request processing.
- the retry related logic can be encapsulated into an API for upper layer application calls, and the upper layer application does not need to care about communication details such as retry.
- Step 3 If the primary server fails offline, its link to the client and backup server will be disconnected. After the standby server detects that the link of the primary server is disconnected, it immediately sets a waiting time (configurable) and waits for the link recovery with the primary server. If the link is restored within this time, it will be received again. The status switch request message of the primary server directly agrees and the entire system is restored to the original state. However, if this time is exceeded and the link with the primary server is still not restored, the standby server sets its own state to the active state and completes the failover. In this process, the client initially detects that the link of the primary server is unavailable, and can only send the request to the standby server. The standby server will always reply to the client service unavailable error code before switching the state to active.
- a waiting time configurable
- the response is returned after the service request is processed, and the response contains other error codes (non-service unavailable). If the primary server recovers during this time, the client then sends a request to it until it receives a response message including the non-service unavailable error code.
- Step 4 If the primary server fails offline, the standby server has been switched to the active state. At this point, if the primary server is repaired and then back online, the primary server will send the secondary server to the standby server. The status switch request message is sent. At this time, the standby server is inactive, but the service request of the client may be being processed at this time, and the existing request processing needs to be completed, so the switch request cannot be immediately agreed, and the reply does not agree. A new service request is sent to the alternate server, and the alternate server reply service is unavailable. After all current business requests have been processed, the replying primary server agrees to its state switching request.
- the primary server When the primary server initially receives a response from the standby server that does not agree with its state switch, it will continuously resend the state switch request message until it receives the consent response from the standby server. The client needs to send a new service request to the primary server during this process. If the error code is received, the service needs to be retried until it receives a response containing other error codes.
- the embodiment of the invention further provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are used to execute the above method.
- FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention.
- the device shown in Figure 5 includes:
- the detecting module 501 is configured to detect whether the active node is working normally
- the control module 502 is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
- the detecting module 501 includes:
- a first detecting unit configured to detect whether a message from the active node can be received through a link between the primary node and the standby node
- a determining unit is configured to determine that the primary node has failed if a message from the primary node is not received over the link.
- the device further comprises:
- the second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
- the switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
- the device further comprises:
- the first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
- the device further comprises:
- a second sending unit configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
- the device embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.
- an embodiment of the present invention provides a high availability cluster, including a first node and a second node including the apparatus shown in FIG. 5.
- the first node is configured to notify the second node to initiate a state switching request if the first node resumes working after the primary node is switched to become the primary node, and after receiving the After the consent message of the second node, the operation of the first node to become the active node is performed.
- the embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.
- each module/unit in the above embodiment may be implemented in the form of hardware, for example, by implementing an integrated circuit to implement its corresponding function, or may be implemented in the form of a software function module, for example, executing a program stored in the memory by a processor. / instruction to achieve its corresponding function.
- the invention is not limited to any specific form of combination of hardware and software.
- the foregoing technical solution can implement fault detection and takeover between the active and standby nodes by the standby node without relying on the third-party arbitration, and provide a new management mode of the active/standby switchover, thereby achieving the purpose of providing high-availability services externally.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Hardware Redundancy (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
A method and device for managing active and backup nodes in a communication system and a high-availability cluster. The method comprises: detecting, by a backup node, whether an active node operates normally; upon the active node is detected to be not operated normally, triggering, by the backup node, execution of an active-backup switching operation.
Description
本文涉及但不限于通信领域,尤其涉及一种通信系统中管理主备节点的方法和装置及高可用集群。This document relates to, but is not limited to, the field of communications, and in particular, to a method and apparatus for managing active and standby nodes and a highly available cluster in a communication system.
在大型商用软件系统中,为保证运行的稳定性,通常要求不能由于一台服务器的失效,导致整个服务中断,即应避免单点故障。典型地,通常由主备两台服务器组成一个集群,正常情况下由主用服务器对外提供服务,当主用服务器故障时,由从服务器(也称为备用服务器)接管并继续对外提供服务。主备用服务器之间的故障检测与接管过程是一个技术难点。目前典型的方法需要借助第三方仲裁来实现,即主备用服务器都定期向仲裁者报告自身状态,由仲裁者来判断是否达到了条件来触发故障接管过程。但是在实际应用中,如果仲裁者本身发生故障出现失效的情况,就无法正常完成故障接管。因此目前亟需一种新的主备管理机制对主备服务器进行管理。In large commercial software systems, in order to ensure the stability of operation, it is usually required that the entire service cannot be interrupted due to the failure of one server, that is, a single point of failure should be avoided. Typically, a server is usually composed of two servers, and the primary server is normally served by the primary server. When the primary server fails, the secondary server (also called the standby server) takes over and continues to provide services. The fault detection and takeover process between the primary and secondary servers is a technical difficulty. At present, the typical method needs to be implemented by means of third-party arbitration, that is, the primary and secondary servers regularly report their status to the arbitrator, and the arbitrator determines whether the condition is reached to trigger the fault takeover process. However, in practical applications, if the arbitrator itself fails, the failure takeover cannot be completed normally. Therefore, there is a need for a new active/standby management mechanism to manage the primary and secondary servers.
发明内容Summary of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.
本发明实施例提供一种通信系统中管理主备节点的方法和装置及高可用集群,能够提供一种新的主备切换的管理方式。The embodiments of the present invention provide a method and device for managing active and standby nodes in a communication system, and a high-availability cluster, which can provide a new management mode for active/standby switchover.
本发明实施例提供了如下技术方案:The embodiments of the present invention provide the following technical solutions:
一种通信系统中管理主备节点的方法,包括:A method for managing active and standby nodes in a communication system, comprising:
备用节点检测主用节点是否正常工作;The standby node detects whether the active node is working normally;
在检测到所述主用节点未正常工作后,所述备用节点触发主备切换操作的执行。
After detecting that the primary node is not working normally, the standby node triggers execution of the active/standby switching operation.
其中,所述备用节点检测主用节点是否正常工作,包括:The standby node detects whether the active node is working normally, including:
通过主用节点和备用节点之间的链路,所述备用节点检测是否能收到来自所述主用节点的心跳消息;Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。If the heartbeat message from the primary node is not received through the link, it is determined that the primary node has failed.
其中,所述方法还包括:The method further includes:
在检测到所述主用节点未正常工作后,在执行主备切换操作之前,所述备用节点在预先设置的等待时间内,继续检测主用节点是否正常工作;After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
如果在所述等待时间内所述主用节点未恢复正常工作,则所述备用节点执行主备切换操作。If the primary node does not resume normal operation during the waiting time, the standby node performs an active/standby switching operation.
其中,所述方法还包括:The method further includes:
如果在所述等待时间内所述主用节点恢复正常工作,则所述备用节点将已接收的业务请求转发给所述主用节点。If the primary node resumes normal operation during the waiting time, the standby node forwards the received service request to the primary node.
其中,所述方法还包括:The method further includes:
所述备用节点在所述等待时间内,如果接收到客户端发送的业务请求,则所述备用节点向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。If the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information that the service requesting the service request is currently unavailable.
一种通信系统中管理主备节点的装置,包括:A device for managing active and standby nodes in a communication system, comprising:
检测模块,设置为检测主用节点是否正常工作;The detection module is configured to detect whether the active node is working normally;
控制模块,设置为在检测到所述主用节点未正常工作后,触发主备切换操作的执行。The control module is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
其中,所述检测模块包括:The detection module includes:
第一检测单元,设置为通过主用节点和备用节点之间的链路,检测是否能收到来自所述主用节点的心跳消息;a first detecting unit, configured to detect whether a heartbeat message from the active node can be received through a link between the primary node and the standby node;
确定单元,设置为如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。A determining unit is configured to determine that the primary node has failed if a heartbeat message from the primary node is not received through the link.
其中,还包括:
Among them, it also includes:
第二检测单元,设置为在检测到所述主用节点未正常工作后,在执行主备切换操作之前,在预先设置的等待时间内,继续检测通过主用节点是否正常工作;The second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
切换模块,设置为如果在所述等待时间内所述主用节点未恢复正常工作,则执行主备切换操作。The switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
其中,所述装置还包括:Wherein, the device further comprises:
第一发送单元,设置为如果在所述等待时间内所述主用节点恢复正常工作,则将已接收的业务请求转发给所述主用节点。The first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
其中,所述装置还包括:Wherein, the device further comprises:
第二发送单元,设置为在所述等待时间内,如果接收到客户端发送的业务请求,则向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。a second sending unit, configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
一种高可用集群,包括第一节点和上文任一所述装置的第二节点。A highly available cluster comprising a first node and a second node of any of the above devices.
其中,所述第一节点还设置为在第二节点发生主备切换成为主用节点后,如果所述第一节点恢复工作,则通知所述第二节点发起状态切换请求,并在接收到所述第二节点的同意消息后,执行所述第一节点成为主用节点的操作。The first node is further configured to notify the second node to initiate a state switching request after receiving the active/standby switchover to become the active node in the second node, and receive the state switch request. After the consent message of the second node is described, the operation of the first node to become the active node is performed.
本发明提供的实施例,在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。The embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.
在阅读并理解了附图和详细描述后,可以明白其他方面。Other aspects will be apparent upon reading and understanding the drawings and detailed description.
附图概述BRIEF abstract
图1为本发明实施例提供的通信系统中管理主备节点的方法的流程图;1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention;
图2为本发明实施例提供的客户端在实现主备节点的管理方法中的方法的流程图;2 is a flowchart of a method for a client to implement a method for managing a primary and a standby node according to an embodiment of the present invention;
图3为本发明实施例提供的主用服务器在实现主备节点的管理方法中的
方法的流程图;FIG. 3 is a schematic diagram of a method for managing a primary and a secondary node by using a primary server according to an embodiment of the present invention;
Flow chart of the method;
图4为本发明实施例提供的备用服务器在实现主备节点的管理方法中的方法的流程图;4 is a flowchart of a method for implementing a method for managing a primary and a standby node by using a standby server according to an embodiment of the present invention;
图5为本发明实施例提供的通信系统中管理主备节点的装置的结构图。FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention.
下面将结合附图及具体实施例对本发明作进一步的详细描述。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。The invention will be further described in detail below with reference to the drawings and specific embodiments. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
图1为本发明实施例提供的通信系统中管理主备节点的方法的流程图。图1所示方法包括:FIG. 1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention. The method shown in Figure 1 includes:
步骤101、备用节点检测主用节点是否正常工作;Step 101: The standby node detects whether the active node is working normally.
步骤102、在检测所述主用节点未正常工作后,所述备用节点触发主备切换操作的执行。Step 102: After detecting that the active node is not working normally, the standby node triggers execution of an active/standby switchover operation.
本发明提供的方法实施例,在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。The method provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.
下面对本发明提供的方法实施例作进一步说明:The method embodiments provided by the present invention are further described below:
在启动主用节点和备用节点后,他们的初始服务状态均为非活跃状态,都不能对外提供服务。主用节点可以先主动向备用节点发起连接请求,在链路建立成功后,主用节点发送状态切换请求消息,消息中指出主用节点请求切换为活跃状态,此时备用节点因为也处于非活跃状态,认为主用节点此时可以立即切换为活跃状态,回复响应同意切换,主用节点收到响应后将自身服务状态置为活跃状态,开始对外提供服务。After the active node and the standby node are started, their initial service status is inactive and cannot be served externally. The active node may initiate a connection request to the standby node actively. After the link is successfully established, the active node sends a status switch request message, indicating that the active node requests to switch to the active state, and the standby node is also inactive. The state assumes that the active node can immediately switch to the active state at this time, and the reply response agrees to switch. After receiving the response, the active node sets its own service state to the active state and starts to provide services externally.
其中,所述备用节点检测主用节点是否正常工作,包括:The standby node detects whether the active node is working normally, including:
通过主用节点和备用节点之间的链路,所述备用节点检测是否能收到来自所述主用节点的心跳消息;
Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;
如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。If the heartbeat message from the primary node is not received through the link, it is determined that the primary node has failed.
其中,来自所述主用节点的消息可以是主用节点主动发送的,也可以是对备用节点发送的消息的响应消息。The message from the active node may be sent by the active node or may be a response message to the message sent by the standby node.
由上可以看出,借助主备节点之间的链接来检测主用节点是否正常工作,实现简单方便。It can be seen from the above that the link between the active and standby nodes is used to detect whether the active node is working normally, and the implementation is simple and convenient.
所述方法还包括:The method further includes:
在检测到所述主用节点未正常工作后,在执行主备切换操作之前,所述备用节点在预先设置的等待时间内,继续检测通过主用节点是否正常工作;After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;
如果在所述等待时间内所述主用节点未恢复正常工作,则所述备用节点执行主备切换操作。If the primary node does not resume normal operation during the waiting time, the standby node performs an active/standby switching operation.
在实际应用中,由于节点在工作过程中极有可能发生短时的故障,如果该故障能够很快的解决,就不需要发起主备切换,避免引起业务处理的迁移,延迟业务的处理进度,因此通过设置一个等待时间,为主用节点解决自身故障提供了一段时间,从而降低业务处理的迁移的可能性,保证数据处理的进度,提高系统的稳定性。In practical applications, since a node is likely to have a short-term fault during the working process, if the fault can be quickly resolved, it is not necessary to initiate an active/standby switchover to avoid the migration of service processing and delay the processing progress of the service. Therefore, by setting a waiting time, the primary node provides a period of time to solve its own fault, thereby reducing the possibility of migration of the business processing, ensuring the progress of the data processing, and improving the stability of the system.
另外,所述备用节点在所述等待时间内,如果接收到客户端发送的业务请求,则所述备用节点向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。In addition, if the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information for the service that is used to process the service request is currently unavailable.
通过告知发起业务请求的客户端当前业务不可用,方便客户端获知节点的处理能力,为客户端后续的操作提供依据。By notifying the client that the service request is initiated that the current service is unavailable, it is convenient for the client to know the processing capability of the node, and provide a basis for subsequent operations of the client.
当然,如果在所述等待时间内所述主用节点恢复正常工作,则所述备用节点将已接收的业务请求转发给所述主用节点。Of course, if the primary node resumes normal operation within the waiting time, the standby node forwards the received service request to the primary node.
下面以节点为服务器为例进行说明:The following takes the node as the server as an example:
在对本发明实施例提供的方法进行说明前,首先对本发明的方法的应用场景作以简单说明:
Before the method provided by the embodiment of the present invention is described, the application scenario of the method of the present invention is first described briefly:
一种网络通信系统,包括:一台主用服务器,一台备用服务器,一个或多个客户端,其中每个客户端与主备用服务器之间分别存在一条通信链路,主用服务器与备用服务器之间存在一条通信链路。主用服务器通过一个物理网络接口对外通信,其中主用服务器对外通信包括与备用服务器和其它一个或多个客户端的通信,其中主用服务器具有唯一的IP地址;备用服务器通过一个物理网络接口对外通信,其中备用服务器对外通信包括与主用服务器和其它一个或多个客户端的通信,备用服务器也具有唯一的IP地址,且不同于主用服务器的IP地址。因此如果是主用服务器故障离线时,其与备用服务器和所有客户端的通信链接将会断开。如果是备用服务器故障离线时,其与主用服务器和所有客户端的通信链接将会断开。A network communication system includes: a primary server, a standby server, and one or more clients, wherein each client and the primary backup server respectively have a communication link, the primary server and the standby server There is a communication link between them. The primary server communicates externally through a physical network interface, wherein the primary server external communication includes communication with the standby server and other one or more clients, wherein the primary server has a unique IP address; the standby server communicates externally through a physical network interface. The standby server external communication includes communication with the primary server and other one or more clients, and the standby server also has a unique IP address and is different from the IP address of the primary server. Therefore, if the primary server fails offline, its communication link with the standby server and all clients will be disconnected. If the standby server fails offline, its communication link with the primary server and all clients will be broken.
图2至图4为依次客户端、主用服务器和备用服务器在实现主备节点的管理方法中的方法的流程图。其中图2至图4的说明,具体如下:2 to FIG. 4 are flowcharts of a method in which a client, a primary server, and a standby server perform a management method of a primary and a secondary node in sequence. The description of Figures 2 to 4 is as follows:
主备用服务器之间的故障检测与接管依赖于对于当前对外链接数量的计算及与对端服务器链路是否存在的判断。The fault detection and takeover between the primary and secondary servers depends on the calculation of the current number of external links and the determination of the existence of the peer server link.
链路映射表:用于保存当前主机的所有外部通信链接信息,其中key值可使用能唯一标识通信对端的识别标志,如对端的IP地址+端口,value值为最近一次收到心跳或心跳响应消息的时间。Link mapping table: used to save all external communication link information of the current host. The key value can use the identifier that uniquely identifies the communication peer, such as the IP address of the peer + port. The value is the last received heartbeat or heartbeat response. The time of the message.
链路数量的计算:Calculation of the number of links:
通信中由通信客户端定时发送心跳消息给通信服务端,而后通信服务端回复心跳响应消息给通信客户端。通信服务端收到心跳消息后即认为一条链路已经建立,在链路映射表中增加一条记录,通信服务端链路数量增加1。与此同时,通信客户端收到心跳响应消息后也认为一条链路已经建立成功,在链路映射表中增加一条记录,通信客户端链路数量增加1。In the communication, the communication client periodically sends a heartbeat message to the communication server, and then the communication server returns a heartbeat response message to the communication client. After receiving the heartbeat message, the communication server considers that a link has been established, adds a record in the link mapping table, and increases the number of links on the communication server by one. At the same time, after receiving the heartbeat response message, the communication client also considers that a link has been successfully established, and adds a record in the link mapping table, and the number of communication client links increases by one.
如果链路已经中断,则通信服务端收不到来自同一通信客户端的心跳消息,在超过一定时间(时间可配置)后,通信服务端认为该链路已经断开,将该记录从其链路映射表中移除,服务端链路数量减少1。同样地,通信客户端收不到来自通信服务端的心跳响应消息,在超过一定时间(时间可配置)后,通信客户端认为该链路已经断开,将该记录从其链路映射表中移除,通信客户端链路数量减少1。
If the link has been interrupted, the communication server does not receive a heartbeat message from the same communication client. After a certain period of time (time configurable), the communication server considers that the link has been disconnected, and records the record from its link. Removed from the mapping table, the number of server links is reduced by 1. Similarly, the communication client does not receive the heartbeat response message from the communication server. After a certain time (time configurable), the communication client considers that the link has been disconnected, and moves the record from its link mapping table. In addition, the number of communication client links is reduced by one.
为方便说明:为方案涉及到的三种角色,客户端,主用服务器,备用服务器分别设置下列参数:For convenience of explanation: For the three roles involved in the solution, the client, the primary server, and the standby server respectively set the following parameters:
1.客户端:1. Client:
与主用服务器通信链路是否正常(查找链路映射表是否有对应主用服务器的记录)Whether the communication link with the primary server is normal (look for the link mapping table for the record corresponding to the primary server)
与备用服务器通信链路是否正常(查找链路映射表是否有对应备用服务器的记录)Whether the communication link with the standby server is normal (it finds whether the link mapping table has a record corresponding to the standby server)
2.主用服务器:2. The main server:
链接数量(链路映射表当前记录的数量)Number of links (number of current records in the link map)
与备用服务器通信链路是否正常(查找链路映射表是否有对应备用服务器的记录)Whether the communication link with the standby server is normal (it finds whether the link mapping table has a record corresponding to the standby server)
服务状态(活跃或非活跃)Service status (active or inactive)
3.备用服务器:3. Standby server:
链接数量(链路映射表当前记录的数量)Number of links (number of current records in the link map)
与主用服务器通信链路是否正常(查找链路映射表是否有对应主用服务器的记录)Whether the communication link with the primary server is normal (look for the link mapping table for the record corresponding to the primary server)
服务状态(活跃或非活跃)Service status (active or inactive)
客户端会向主(备)服务器发送业务请求消息,主(备)服务器会返回响应消息。The client sends a service request message to the primary (standby) server, and the primary (standby) server returns a response message.
主用服务器会向备用服务器发送状态切换请求消息,备用服务器会返回响应消息。The primary server sends a status switch request message to the standby server, and the standby server returns a response message.
上述两种响应消息格式中应包括错误码,如响应消息格式为:错误码+响应消息内容,错误码主要用于判断请求操作是否被成功处理,该请求是否需要被重新发送。The above two response message formats should include an error code. For example, the response message format is: error code + response message content, and the error code is mainly used to determine whether the request operation is successfully processed, and whether the request needs to be resent.
主用服务器和备用服务器之间的通信约定由一方作为通信客户端主动向对方发起连接请求,我们假定由主用服务器主动向备用服务器发起连接请求,主用服务器和备用服务器之间只有一条通信链路。
The communication agreement between the primary server and the standby server is initiated by the party as the communication client to initiate a connection request to the other party. We assume that the primary server actively initiates a connection request to the standby server, and there is only one communication link between the primary server and the standby server. road.
步骤一:分别启动主用服务器和备用服务器,他们的初始服务状态均为非活跃状态,都不能对外提供服务。主用服务器先主动向备用服务器发起连接请求,在链路建立成功后,主用服务器发送状态切换请求消息,消息中指出主用服务器请求切换为活跃状态,此时备用服务器因为也处于非活跃状态,认为主用服务器此时可以立即切换为活跃状态,回复响应同意切换,主用服务器收到响应后将自身服务状态置为活跃状态,开始对外提供服务。Step 1: Start the primary server and the standby server respectively. Their initial service status is inactive and cannot provide services externally. The primary server initiates a connection request to the standby server. After the link is successfully established, the primary server sends a status switch request message, indicating that the primary server requests the switch to be in an active state, and the standby server is also in an inactive state. It is considered that the active server can immediately switch to the active state at this time, and the response response agrees to switch. After receiving the response, the primary server sets its own service state to an active state and starts to provide external services.
步骤二:客户端向主或备用服务器发送具体业务消息,并接收响应消息,响应消息中会包括一个错误码,该错误码用于标识该请求消息是否真正得到了处理。Step 2: The client sends a specific service message to the primary or secondary server, and receives a response message. The response message includes an error code, and the error code is used to identify whether the request message is actually processed.
客户端访问时基于以下原则,如果与主用服务器的链路正常,则将请求消息向主用服务器发送,反之则向备用服务器发送。当主或备用服务器收到客户端请求时,如果服务状态为非活跃状态,则向客户端回复服务不可用错误码。除非客户端与主备用服务器的链路都断开,否则客户端需要不断重试发送请求消息,直至收到其它错误码时,则表示请求消息已经被成功处理,可从响应消息中解析具体业务请求的处理结果。可将重试相关逻辑封装到一个API中供上层应用调用,上层应用无须关心重试等通信细节。Client access is based on the following principle: If the link with the primary server is normal, the request message is sent to the primary server, and vice versa. When the primary or secondary server receives a client request, if the service status is inactive, the client is replied to the service unavailable error code. Unless the link between the client and the primary and secondary servers is disconnected, the client needs to continually retry sending the request message until the other error code is received, indicating that the request message has been successfully processed, and the specific service can be parsed from the response message. The result of the request processing. The retry related logic can be encapsulated into an API for upper layer application calls, and the upper layer application does not need to care about communication details such as retry.
步骤三:如果主用服务器故障离线,它与客户端和备用服务器的链路将会断开。备用服务器检测到主用服务器的链路断开后,立即设定一个等待时间(可配置),等待与主用服务器的链路恢复,如果在这个时间内,链路恢复,则会重新收到主用服务器的状态切换请求消息,直接回复同意,整个系统恢复到原样。但如果超过了这个时间,与主用服务器的链路仍未恢复,则备用服务器将自身状态置为活跃状态,完成故障接管。在这个过程中,客户端初始检测到主用服务器的链路不可用,只能将请求发送给备用服务器,备用服务器在状态切换为活跃之前,会一直回复给客户端服务不可用错误码,切换为活跃状态之后,处理完业务请求后回复响应,响应中包含其它错误码(非服务不可用)。如果主用服务器在这期间恢复,则客户端转而向其发送请求,直到收到包括非服务不可用错误码的响应消息。Step 3: If the primary server fails offline, its link to the client and backup server will be disconnected. After the standby server detects that the link of the primary server is disconnected, it immediately sets a waiting time (configurable) and waits for the link recovery with the primary server. If the link is restored within this time, it will be received again. The status switch request message of the primary server directly agrees and the entire system is restored to the original state. However, if this time is exceeded and the link with the primary server is still not restored, the standby server sets its own state to the active state and completes the failover. In this process, the client initially detects that the link of the primary server is unavailable, and can only send the request to the standby server. The standby server will always reply to the client service unavailable error code before switching the state to active. After the active state, the response is returned after the service request is processed, and the response contains other error codes (non-service unavailable). If the primary server recovers during this time, the client then sends a request to it until it receives a response message including the non-service unavailable error code.
步骤四:如果主用服务器故障离线,备用服务器已经实现接管切换为活跃状态。此时如果主用服务器修复后重新上线,主用服务器会向备用服务器发
送状态切换请求消息,此时备用服务器置为非活跃状态,但此时可能正在处理客户端的业务请求,需要等待现有请求处理完成,所以不能立即同意其切换请求,回复不同意,这时如果有新的业务请求发送到备用服务器,备用服务器回复服务不可用。直到所有当前的业务请求处理完毕后,回复主用服务器同意其状态切换请求。主用服务器初始在收到备用服务器不同意其状态切换的响应时,会不断重发状态切换请求消息,直至收到备用服务器的同意响应为止。客户端在这个过程中新增业务请求需要向主用服务器发送,如果收到错误码为服务不可用时需要重试,直到收到包含其它错误码的响应。Step 4: If the primary server fails offline, the standby server has been switched to the active state. At this point, if the primary server is repaired and then back online, the primary server will send the secondary server to the standby server.
The status switch request message is sent. At this time, the standby server is inactive, but the service request of the client may be being processed at this time, and the existing request processing needs to be completed, so the switch request cannot be immediately agreed, and the reply does not agree. A new service request is sent to the alternate server, and the alternate server reply service is unavailable. After all current business requests have been processed, the replying primary server agrees to its state switching request. When the primary server initially receives a response from the standby server that does not agree with its state switch, it will continuously resend the state switch request message until it receives the consent response from the standby server. The client needs to send a new service request to the primary server during this process. If the error code is received, the service needs to be retried until it receives a response containing other error codes.
本发明实施例还提供了一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行上述方法。The embodiment of the invention further provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are used to execute the above method.
图5为本发明实施例提供的种通信系统中管理主备节点的装置的结构图。图5所示装置包括:FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention. The device shown in Figure 5 includes:
检测模块501,设置为检测主用节点是否正常工作;The detecting module 501 is configured to detect whether the active node is working normally;
控制模块502,设置为在检测所述主用节点未正常工作后,触发主备切换操作的执行。The control module 502 is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
其中,所述检测模块501包括:The detecting module 501 includes:
第一检测单元,设置为通过主用节点和备用节点之间的链路,检测是否能收到来自所述主用节点的消息;a first detecting unit configured to detect whether a message from the active node can be received through a link between the primary node and the standby node;
确定单元,设置为如果通过所述链路未收到来自所述主用节点的消息,则确定所述主用节点发生故障。A determining unit is configured to determine that the primary node has failed if a message from the primary node is not received over the link.
其中,所述装置还包括:Wherein, the device further comprises:
第二检测单元,设置为在检测到所述主用节点未正常工作后,在执行主备切换操作之前,在预先设置的等待时间内,继续检测通过主用节点是否正常工作;The second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;
切换模块,设置为如果在所述等待时间内所述主用节点未恢复正常工作,则执行主备切换操作。
The switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
其中,所述装置还包括:Wherein, the device further comprises:
第一发送单元,设置为如果在所述等待时间内所述主用节点恢复正常工作,则将已接收的业务请求转发给所述主用节点。The first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
其中,所述装置还包括:Wherein, the device further comprises:
第二发送单元,设置为在所述等待时间内,如果接收到客户端发送的业务请求,则向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。a second sending unit, configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
本发明提供的装置实施例,在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。The device embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.
另外,本发明实施例提供一种高可用集群,包括第一节点和包括图5所示装置的第二节点。In addition, an embodiment of the present invention provides a high availability cluster, including a first node and a second node including the apparatus shown in FIG. 5.
其中,所述第一节点设置为在第二节点发生主备切换成为主用节点后,如果所述第一节点恢复工作,则通知所述第二节点发起状态切换请求,并在接收到所述第二节点的同意消息后,执行所述第一节点成为主用节点的操作。The first node is configured to notify the second node to initiate a state switching request if the first node resumes working after the primary node is switched to become the primary node, and after receiving the After the consent message of the second node, the operation of the first node to become the active node is performed.
本发明提供的实施例,在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。The embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件(例如处理器)完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,例如通过集成电路来实现其相应功能,也可以采用软件功能模块的形式实现,例如通过处理器执行存储于存储器中的程序/指令来实现其相应功能。本发明不限制于任何特定形式的硬件和软件的结合。
One of ordinary skill in the art will appreciate that all or a portion of the above steps may be performed by a program to instruct related hardware, such as a processor, which may be stored in a computer readable storage medium, such as a read only memory, disk or optical disk. Wait. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the above embodiment may be implemented in the form of hardware, for example, by implementing an integrated circuit to implement its corresponding function, or may be implemented in the form of a software function module, for example, executing a program stored in the memory by a processor. / instruction to achieve its corresponding function. The invention is not limited to any specific form of combination of hardware and software.
本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的精神和范围,均应涵盖在本发明的权利要求范围当中。It should be understood by those skilled in the art that the present invention may be modified or equivalently substituted without departing from the spirit and scope of the invention.
上述技术方案可实现在不依赖于第三方仲裁的情况下由备用节点完成主备节点之间的故障检测与接管,提供了新的主备切换的管理方式,实现对外提供高可用服务的目的。
The foregoing technical solution can implement fault detection and takeover between the active and standby nodes by the standby node without relying on the third-party arbitration, and provide a new management mode of the active/standby switchover, thereby achieving the purpose of providing high-availability services externally.
Claims (12)
- 一种通信系统中管理主备节点的方法,包括:A method for managing active and standby nodes in a communication system, comprising:备用节点检测主用节点是否正常工作;The standby node detects whether the active node is working normally;在检测到所述主用节点未正常工作后,所述备用节点触发主备切换操作的执行。After detecting that the primary node is not working normally, the standby node triggers execution of the active/standby switching operation.
- 根据权利要求1所述的方法,其中,所述备用节点检测主用节点是否正常工作,包括:The method of claim 1, wherein the standby node detects whether the active node is working properly, including:通过主用节点和备用节点之间的链路,所述备用节点检测是否能收到来自所述主用节点的心跳消息;Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。If the heartbeat message from the primary node is not received through the link, it is determined that the primary node has failed.
- 根据权利要求1所述的方法,所述方法还包括:The method of claim 1 further comprising:在检测到所述主用节点未正常工作后,在执行主备切换操作之前,所述备用节点在预先设置的等待时间内,继续检测主用节点是否正常工作;After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;如果在所述等待时间内所述主用节点未恢复正常工作,则所述备用节点执行主备切换操作。If the primary node does not resume normal operation during the waiting time, the standby node performs an active/standby switching operation.
- 根据权利要求3所述的方法,所述方法还包括:The method of claim 3, further comprising:如果在所述等待时间内所述主用节点恢复正常工作,则所述备用节点将已接收的业务请求转发给所述主用节点。If the primary node resumes normal operation during the waiting time, the standby node forwards the received service request to the primary node.
- 根据权利要求3所述的方法,所述方法还包括:The method of claim 3, further comprising:所述备用节点在所述等待时间内,如果接收到客户端发送的业务请求,则所述备用节点向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。If the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information that the service requesting the service request is currently unavailable.
- 一种通信系统中管理主备节点的装置,包括:A device for managing active and standby nodes in a communication system, comprising:检测模块,设置为检测主用节点是否正常工作; The detection module is configured to detect whether the active node is working normally;控制模块,设置为在检测到所述主用节点未正常工作后,触发主备切换操作的执行。The control module is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
- 根据权利要求6所述的装置,其中,所述检测模块包括:The apparatus of claim 6 wherein said detecting module comprises:第一检测单元,设置为通过主用节点和备用节点之间的链路,检测是否能收到来自所述主用节点的心跳消息;a first detecting unit, configured to detect whether a heartbeat message from the active node can be received through a link between the primary node and the standby node;确定单元,设置为如果通过所述链路未收到来自所述主用节点的心跳消息,则确定所述主用节点发生故障。A determining unit is configured to determine that the primary node has failed if a heartbeat message from the primary node is not received through the link.
- 根据权利要求6所述的装置,还包括:The apparatus of claim 6 further comprising:第二检测单元,设置为在检测到所述主用节点未正常工作后,在执行主备切换操作之前,在预先设置的等待时间内,继续检测通过主用节点是否正常工作;The second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;切换模块,设置为如果在所述等待时间内所述主用节点未恢复正常工作,则执行主备切换操作。The switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
- 根据权利要求8所述的装置,还包括:The apparatus of claim 8 further comprising:第一发送单元,设置为如果在所述等待时间内所述主用节点恢复正常工作,则将已接收的业务请求转发给所述主用节点。The first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
- 根据权利要求8所述的装置,还包括:The apparatus of claim 8 further comprising:第二发送单元,设置为在所述等待时间内,如果接收到客户端发送的业务请求,则向所述客户端发送与所述业务请求对应的业务响应,其中所述业务响应包括用于处理所述业务请求的服务当前不可用的信息。a second sending unit, configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
- 一种高可用集群,包括第一节点和包括如权利要求6至10任一所述装置的第二节点。A highly available cluster comprising a first node and a second node comprising the apparatus of any of claims 6-10.
- 根据权利要求11所述的高可用集群,其中,所述第一节点设置为在第二节点发生主备切换成为主用节点后,如果所述第一节点恢复工作,则通知所述第二节点发起状态切换请求,并在接收到所述第二节点的同意消息后,执行所述第一节点成为主用节点的操作。 The high availability cluster according to claim 11, wherein the first node is configured to notify the second node if the first node resumes operation after the primary node switches to become the primary node. And initiating a state switching request, and after receiving the consent message of the second node, performing an operation that the first node becomes a primary node.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510331124.2A CN106330475B (en) | 2015-06-15 | 2015-06-15 | Method and device for managing main and standby nodes in communication system and high-availability cluster |
CN201510331124.2 | 2015-06-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016202051A1 true WO2016202051A1 (en) | 2016-12-22 |
Family
ID=57544964
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2016/078490 WO2016202051A1 (en) | 2015-06-15 | 2016-04-05 | Method and device for managing active and backup nodes in communication system and high-availability cluster |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106330475B (en) |
WO (1) | WO2016202051A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112787917A (en) * | 2019-11-11 | 2021-05-11 | 中兴通讯股份有限公司 | Protection method, end node, protection group network and storage medium for flexible Ethernet |
CN114257500A (en) * | 2021-12-24 | 2022-03-29 | 苏州浪潮智能科技有限公司 | Fault switching method, system and device for internal network of super-converged cluster |
CN114466391A (en) * | 2022-03-21 | 2022-05-10 | 中国电信股份有限公司 | Network element equipment state updating method and device, storage medium and electronic equipment |
CN116582618A (en) * | 2023-07-13 | 2023-08-11 | 天津金城银行股份有限公司 | Method and device for realizing high availability of electric pin, machine room management platform and computer |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106911524B (en) * | 2017-04-27 | 2020-07-07 | 新华三信息技术有限公司 | HA implementation method and device |
CN107528724B (en) * | 2017-07-20 | 2020-09-29 | 奇安信科技集团股份有限公司 | Optimization processing method and device for node cluster |
CN109428740B (en) * | 2017-08-21 | 2020-09-08 | 华为技术有限公司 | Method and device for recovering equipment failure |
CN108023775A (en) * | 2017-12-07 | 2018-05-11 | 湖北三新文化传媒有限公司 | High-availability cluster architecture system and method |
CN108023891A (en) * | 2017-12-12 | 2018-05-11 | 北京安博通科技股份有限公司 | A kind of tunnel switching method based on IPSEC, device and gateway |
CN109101367A (en) * | 2018-08-15 | 2018-12-28 | 郑州云海信息技术有限公司 | The management method and device of component in cloud computing system |
CN109344015B (en) * | 2018-10-10 | 2022-05-24 | 武汉达梦数据库股份有限公司 | Method and system for preventing double main nodes by using HA (home agent) for database service |
CN110300023A (en) * | 2019-06-28 | 2019-10-01 | 上海智臻智能网络科技股份有限公司 | A kind of state switching method, device, node, node group and storage medium |
CN115134219A (en) * | 2022-06-29 | 2022-09-30 | 北京飞讯数码科技有限公司 | Device resource management method and device, computing device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040034807A1 (en) * | 2002-08-14 | 2004-02-19 | Gnp Computers, Inc. | Roving servers in a clustered telecommunication distributed computer system |
CN101039172A (en) * | 2007-05-15 | 2007-09-19 | 华为技术有限公司 | Ethernet ring network system and its protection method and standby host node |
CN101179432A (en) * | 2007-12-13 | 2008-05-14 | 浪潮电子信息产业股份有限公司 | Method of implementing high availability of system in multi-machine surroundings |
CN101335702A (en) * | 2008-07-07 | 2008-12-31 | 中兴通讯股份有限公司 | Disaster recovery method of serving GPRS support node |
CN102118309A (en) * | 2010-12-31 | 2011-07-06 | 中国科学院计算技术研究所 | Method and system for double-machine hot backup |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015023458A (en) * | 2013-07-19 | 2015-02-02 | 富士通株式会社 | Communication system, redundancy control method in communication system, and transmission device |
CN103490969B (en) * | 2013-09-17 | 2016-07-06 | 烽火通信科技股份有限公司 | Realize the system and method for VPWS redundancy protecting Fast Convergent |
-
2015
- 2015-06-15 CN CN201510331124.2A patent/CN106330475B/en active Active
-
2016
- 2016-04-05 WO PCT/CN2016/078490 patent/WO2016202051A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040034807A1 (en) * | 2002-08-14 | 2004-02-19 | Gnp Computers, Inc. | Roving servers in a clustered telecommunication distributed computer system |
CN101039172A (en) * | 2007-05-15 | 2007-09-19 | 华为技术有限公司 | Ethernet ring network system and its protection method and standby host node |
CN101179432A (en) * | 2007-12-13 | 2008-05-14 | 浪潮电子信息产业股份有限公司 | Method of implementing high availability of system in multi-machine surroundings |
CN101335702A (en) * | 2008-07-07 | 2008-12-31 | 中兴通讯股份有限公司 | Disaster recovery method of serving GPRS support node |
CN102118309A (en) * | 2010-12-31 | 2011-07-06 | 中国科学院计算技术研究所 | Method and system for double-machine hot backup |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112787917A (en) * | 2019-11-11 | 2021-05-11 | 中兴通讯股份有限公司 | Protection method, end node, protection group network and storage medium for flexible Ethernet |
CN114257500A (en) * | 2021-12-24 | 2022-03-29 | 苏州浪潮智能科技有限公司 | Fault switching method, system and device for internal network of super-converged cluster |
CN114257500B (en) * | 2021-12-24 | 2023-06-09 | 苏州浪潮智能科技有限公司 | Fault switching method, system and device for super-fusion cluster internal network |
CN114466391A (en) * | 2022-03-21 | 2022-05-10 | 中国电信股份有限公司 | Network element equipment state updating method and device, storage medium and electronic equipment |
CN116582618A (en) * | 2023-07-13 | 2023-08-11 | 天津金城银行股份有限公司 | Method and device for realizing high availability of electric pin, machine room management platform and computer |
CN116582618B (en) * | 2023-07-13 | 2023-10-10 | 天津金城银行股份有限公司 | Method and device for realizing high availability of electric pin, machine room management platform and computer |
Also Published As
Publication number | Publication date |
---|---|
CN106330475B (en) | 2020-12-04 |
CN106330475A (en) | 2017-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016202051A1 (en) | Method and device for managing active and backup nodes in communication system and high-availability cluster | |
US11163653B2 (en) | Storage cluster failure detection | |
US6952766B2 (en) | Automated node restart in clustered computer system | |
US6983324B1 (en) | Dynamic modification of cluster communication parameters in clustered computer system | |
US11330071B2 (en) | Inter-process communication fault detection and recovery system | |
JP5863942B2 (en) | Provision of witness service | |
US20080288812A1 (en) | Cluster system and an error recovery method thereof | |
JP2010045760A (en) | Connection recovery device for redundant system, method and processing program | |
WO2012097588A1 (en) | Data storage method, apparatus and system | |
US11889330B2 (en) | Methods and related devices for implementing disaster recovery | |
US20140359340A1 (en) | Subscriptions that indicate the presence of application servers | |
WO2017215430A1 (en) | Node management method in cluster and node device | |
WO2016107443A1 (en) | Snapshot processing method and related device | |
TW200920027A (en) | Intelligent failover in a load-balanced networking environment | |
WO2017071384A1 (en) | Message processing method and apparatus | |
AU2014321418A1 (en) | Email webclient notification queuing | |
CN109189854B (en) | Method and node equipment for providing continuous service | |
CN108200151B (en) | ISCSI Target load balancing method and device in distributed storage system | |
US20130185425A1 (en) | Method for Optimizing Network Performance After A Temporary Loss of Connection | |
CN110351122B (en) | Disaster recovery method, device, system and electronic equipment | |
CN110661599B (en) | HA implementation method, device and storage medium between main node and standby node | |
CN113596195B (en) | Public IP address management method, device, main node and storage medium | |
WO2021238579A1 (en) | Method for managing sata hard disk by means of storage system, and storage system | |
JP2009075710A (en) | Redundant system | |
JP2007141129A (en) | System switching method, computer system and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16810792 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16810792 Country of ref document: EP Kind code of ref document: A1 |