WO2016202051A1

WO2016202051A1 - Method and device for managing active and backup nodes in communication system and high-availability cluster

Info

Publication number: WO2016202051A1
Application number: PCT/CN2016/078490
Authority: WO
Inventors: 白涛; 陈河堆
Original assignee: 中兴通讯股份有限公司
Priority date: 2015-06-15
Filing date: 2016-04-05
Publication date: 2016-12-22
Also published as: CN106330475B; CN106330475A

Abstract

A method and device for managing active and backup nodes in a communication system and a high-availability cluster. The method comprises: detecting, by a backup node, whether an active node operates normally; upon the active node is detected to be not operated normally, triggering, by the backup node, execution of an active-backup switching operation.

Description

Method and device for managing active and standby nodes in communication system and high availability cluster

Technical field

This document relates to, but is not limited to, the field of communications, and in particular, to a method and apparatus for managing active and standby nodes and a highly available cluster in a communication system.

Background technique

In large commercial software systems, in order to ensure the stability of operation, it is usually required that the entire service cannot be interrupted due to the failure of one server, that is, a single point of failure should be avoided. Typically, a server is usually composed of two servers, and the primary server is normally served by the primary server. When the primary server fails, the secondary server (also called the standby server) takes over and continues to provide services. The fault detection and takeover process between the primary and secondary servers is a technical difficulty. At present, the typical method needs to be implemented by means of third-party arbitration, that is, the primary and secondary servers regularly report their status to the arbitrator, and the arbitrator determines whether the condition is reached to trigger the fault takeover process. However, in practical applications, if the arbitrator itself fails, the failure takeover cannot be completed normally. Therefore, there is a need for a new active/standby management mechanism to manage the primary and secondary servers.

Summary of the invention

The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.

The embodiments of the present invention provide a method and device for managing active and standby nodes in a communication system, and a high-availability cluster, which can provide a new management mode for active/standby switchover.

The embodiments of the present invention provide the following technical solutions:

A method for managing active and standby nodes in a communication system, comprising:

The standby node detects whether the active node is working normally;

After detecting that the primary node is not working normally, the standby node triggers execution of the active/standby switching operation.

The standby node detects whether the active node is working normally, including:

Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;

If the heartbeat message from the primary node is not received through the link, it is determined that the primary node has failed.

The method further includes:

After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;

If the primary node does not resume normal operation during the waiting time, the standby node performs an active/standby switching operation.

The method further includes:

If the primary node resumes normal operation during the waiting time, the standby node forwards the received service request to the primary node.

The method further includes:

If the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information that the service requesting the service request is currently unavailable.

A device for managing active and standby nodes in a communication system, comprising:

The detection module is configured to detect whether the active node is working normally;

The control module is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.

The detection module includes:

a first detecting unit, configured to detect whether a heartbeat message from the active node can be received through a link between the primary node and the standby node;

A determining unit is configured to determine that the primary node has failed if a heartbeat message from the primary node is not received through the link.

Among them, it also includes:

The second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;

The switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.

Wherein, the device further comprises:

The first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.

Wherein, the device further comprises:

a second sending unit, configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.

A highly available cluster comprising a first node and a second node of any of the above devices.

The first node is further configured to notify the second node to initiate a state switching request after receiving the active/standby switchover to become the active node in the second node, and receive the state switch request. After the consent message of the second node is described, the operation of the first node to become the active node is performed.

The embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. purpose.

Other aspects will be apparent upon reading and understanding the drawings and detailed description.

BRIEF abstract

1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention;

2 is a flowchart of a method for a client to implement a method for managing a primary and a standby node according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a method for managing a primary and a secondary node by using a primary server according to an embodiment of the present invention; Flow chart of the method;

4 is a flowchart of a method for implementing a method for managing a primary and a standby node by using a standby server according to an embodiment of the present invention;

FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention.

Embodiments of the invention

The invention will be further described in detail below with reference to the drawings and specific embodiments. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.

FIG. 1 is a flowchart of a method for managing active and standby nodes in a communication system according to an embodiment of the present invention. The method shown in Figure 1 includes:

Step 101: The standby node detects whether the active node is working normally.

Step 102: After detecting that the active node is not working normally, the standby node triggers execution of an active/standby switchover operation.

The method provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.

The method embodiments provided by the present invention are further described below:

After the active node and the standby node are started, their initial service status is inactive and cannot be served externally. The active node may initiate a connection request to the standby node actively. After the link is successfully established, the active node sends a status switch request message, indicating that the active node requests to switch to the active state, and the standby node is also inactive. The state assumes that the active node can immediately switch to the active state at this time, and the reply response agrees to switch. After receiving the response, the active node sets its own service state to the active state and starts to provide services externally.

The message from the active node may be sent by the active node or may be a response message to the message sent by the standby node.

It can be seen from the above that the link between the active and standby nodes is used to detect whether the active node is working normally, and the implementation is simple and convenient.

The method further includes:

In practical applications, since a node is likely to have a short-term fault during the working process, if the fault can be quickly resolved, it is not necessary to initiate an active/standby switchover to avoid the migration of service processing and delay the processing progress of the service. Therefore, by setting a waiting time, the primary node provides a period of time to solve its own fault, thereby reducing the possibility of migration of the business processing, ensuring the progress of the data processing, and improving the stability of the system.

In addition, if the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information for the service that is used to process the service request is currently unavailable.

By notifying the client that the service request is initiated that the current service is unavailable, it is convenient for the client to know the processing capability of the node, and provide a basis for subsequent operations of the client.

Of course, if the primary node resumes normal operation within the waiting time, the standby node forwards the received service request to the primary node.

The following takes the node as the server as an example:

Before the method provided by the embodiment of the present invention is described, the application scenario of the method of the present invention is first described briefly:

A network communication system includes: a primary server, a standby server, and one or more clients, wherein each client and the primary backup server respectively have a communication link, the primary server and the standby server There is a communication link between them. The primary server communicates externally through a physical network interface, wherein the primary server external communication includes communication with the standby server and other one or more clients, wherein the primary server has a unique IP address; the standby server communicates externally through a physical network interface. The standby server external communication includes communication with the primary server and other one or more clients, and the standby server also has a unique IP address and is different from the IP address of the primary server. Therefore, if the primary server fails offline, its communication link with the standby server and all clients will be disconnected. If the standby server fails offline, its communication link with the primary server and all clients will be broken.

2 to FIG. 4 are flowcharts of a method in which a client, a primary server, and a standby server perform a management method of a primary and a secondary node in sequence. The description of Figures 2 to 4 is as follows:

The fault detection and takeover between the primary and secondary servers depends on the calculation of the current number of external links and the determination of the existence of the peer server link.

Link mapping table: used to save all external communication link information of the current host. The key value can use the identifier that uniquely identifies the communication peer, such as the IP address of the peer + port. The value is the last received heartbeat or heartbeat response. The time of the message.

Calculation of the number of links:

In the communication, the communication client periodically sends a heartbeat message to the communication server, and then the communication server returns a heartbeat response message to the communication client. After receiving the heartbeat message, the communication server considers that a link has been established, adds a record in the link mapping table, and increases the number of links on the communication server by one. At the same time, after receiving the heartbeat response message, the communication client also considers that a link has been successfully established, and adds a record in the link mapping table, and the number of communication client links increases by one.

If the link has been interrupted, the communication server does not receive a heartbeat message from the same communication client. After a certain period of time (time configurable), the communication server considers that the link has been disconnected, and records the record from its link. Removed from the mapping table, the number of server links is reduced by 1. Similarly, the communication client does not receive the heartbeat response message from the communication server. After a certain time (time configurable), the communication client considers that the link has been disconnected, and moves the record from its link mapping table. In addition, the number of communication client links is reduced by one.

For convenience of explanation: For the three roles involved in the solution, the client, the primary server, and the standby server respectively set the following parameters:

1. Client:

Whether the communication link with the primary server is normal (look for the link mapping table for the record corresponding to the primary server)

Whether the communication link with the standby server is normal (it finds whether the link mapping table has a record corresponding to the standby server)

2. The main server:

Number of links (number of current records in the link map)

Service status (active or inactive)

3. Standby server:

Number of links (number of current records in the link map)

Service status (active or inactive)

The client sends a service request message to the primary (standby) server, and the primary (standby) server returns a response message.

The primary server sends a status switch request message to the standby server, and the standby server returns a response message.

The above two response message formats should include an error code. For example, the response message format is: error code + response message content, and the error code is mainly used to determine whether the request operation is successfully processed, and whether the request needs to be resent.

The communication agreement between the primary server and the standby server is initiated by the party as the communication client to initiate a connection request to the other party. We assume that the primary server actively initiates a connection request to the standby server, and there is only one communication link between the primary server and the standby server. road.

Step 1: Start the primary server and the standby server respectively. Their initial service status is inactive and cannot provide services externally. The primary server initiates a connection request to the standby server. After the link is successfully established, the primary server sends a status switch request message, indicating that the primary server requests the switch to be in an active state, and the standby server is also in an inactive state. It is considered that the active server can immediately switch to the active state at this time, and the response response agrees to switch. After receiving the response, the primary server sets its own service state to an active state and starts to provide external services.

Step 2: The client sends a specific service message to the primary or secondary server, and receives a response message. The response message includes an error code, and the error code is used to identify whether the request message is actually processed.

Client access is based on the following principle: If the link with the primary server is normal, the request message is sent to the primary server, and vice versa. When the primary or secondary server receives a client request, if the service status is inactive, the client is replied to the service unavailable error code. Unless the link between the client and the primary and secondary servers is disconnected, the client needs to continually retry sending the request message until the other error code is received, indicating that the request message has been successfully processed, and the specific service can be parsed from the response message. The result of the request processing. The retry related logic can be encapsulated into an API for upper layer application calls, and the upper layer application does not need to care about communication details such as retry.

Step 3: If the primary server fails offline, its link to the client and backup server will be disconnected. After the standby server detects that the link of the primary server is disconnected, it immediately sets a waiting time (configurable) and waits for the link recovery with the primary server. If the link is restored within this time, it will be received again. The status switch request message of the primary server directly agrees and the entire system is restored to the original state. However, if this time is exceeded and the link with the primary server is still not restored, the standby server sets its own state to the active state and completes the failover. In this process, the client initially detects that the link of the primary server is unavailable, and can only send the request to the standby server. The standby server will always reply to the client service unavailable error code before switching the state to active. After the active state, the response is returned after the service request is processed, and the response contains other error codes (non-service unavailable). If the primary server recovers during this time, the client then sends a request to it until it receives a response message including the non-service unavailable error code.

Step 4: If the primary server fails offline, the standby server has been switched to the active state. At this point, if the primary server is repaired and then back online, the primary server will send the secondary server to the standby server. The status switch request message is sent. At this time, the standby server is inactive, but the service request of the client may be being processed at this time, and the existing request processing needs to be completed, so the switch request cannot be immediately agreed, and the reply does not agree. A new service request is sent to the alternate server, and the alternate server reply service is unavailable. After all current business requests have been processed, the replying primary server agrees to its state switching request. When the primary server initially receives a response from the standby server that does not agree with its state switch, it will continuously resend the state switch request message until it receives the consent response from the standby server. The client needs to send a new service request to the primary server during this process. If the error code is received, the service needs to be retried until it receives a response containing other error codes.

The embodiment of the invention further provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are used to execute the above method.

FIG. 5 is a structural diagram of an apparatus for managing active and standby nodes in a communication system according to an embodiment of the present invention. The device shown in Figure 5 includes:

The detecting module 501 is configured to detect whether the active node is working normally;

The control module 502 is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.

The detecting module 501 includes:

a first detecting unit configured to detect whether a message from the active node can be received through a link between the primary node and the standby node;

A determining unit is configured to determine that the primary node has failed if a message from the primary node is not received over the link.

Wherein, the device further comprises:

The device embodiment provided by the present invention completes the fault detection and takeover between the active and standby nodes by the standby node without relying on the third party arbitration, and provides a new management mode of the active/standby switchover to provide externally available high availability services. the goal of.

In addition, an embodiment of the present invention provides a high availability cluster, including a first node and a second node including the apparatus shown in FIG. 5.

The first node is configured to notify the second node to initiate a state switching request if the first node resumes working after the primary node is switched to become the primary node, and after receiving the After the consent message of the second node, the operation of the first node to become the active node is performed.

One of ordinary skill in the art will appreciate that all or a portion of the above steps may be performed by a program to instruct related hardware, such as a processor, which may be stored in a computer readable storage medium, such as a read only memory, disk or optical disk. Wait. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the above embodiment may be implemented in the form of hardware, for example, by implementing an integrated circuit to implement its corresponding function, or may be implemented in the form of a software function module, for example, executing a program stored in the memory by a processor. / instruction to achieve its corresponding function. The invention is not limited to any specific form of combination of hardware and software.

It should be understood by those skilled in the art that the present invention may be modified or equivalently substituted without departing from the spirit and scope of the invention.

Industrial applicability

The foregoing technical solution can implement fault detection and takeover between the active and standby nodes by the standby node without relying on the third-party arbitration, and provide a new management mode of the active/standby switchover, thereby achieving the purpose of providing high-availability services externally.

Claims

A method for managing active and standby nodes in a communication system, comprising:

The standby node detects whether the active node is working normally;

After detecting that the primary node is not working normally, the standby node triggers execution of the active/standby switching operation.
The method of claim 1, wherein the standby node detects whether the active node is working properly, including:

Passing a link between the primary node and the standby node, the standby node detecting whether a heartbeat message from the primary node can be received;

If the heartbeat message from the primary node is not received through the link, it is determined that the primary node has failed.
The method of claim 1 further comprising:

After detecting that the active node is not working normally, before performing the active/standby switching operation, the standby node continues to detect whether the active node is working normally within a preset waiting time;

If the primary node does not resume normal operation during the waiting time, the standby node performs an active/standby switching operation.
The method of claim 3, further comprising:

If the primary node resumes normal operation during the waiting time, the standby node forwards the received service request to the primary node.
The method of claim 3, further comprising:

If the standby node receives the service request sent by the client, the standby node sends a service response corresponding to the service request to the client, where the service response includes Information that the service requesting the service request is currently unavailable.
A device for managing active and standby nodes in a communication system, comprising:

The detection module is configured to detect whether the active node is working normally;

The control module is configured to trigger execution of the active/standby switching operation after detecting that the active node is not working normally.
The apparatus of claim 6 wherein said detecting module comprises:

a first detecting unit, configured to detect whether a heartbeat message from the active node can be received through a link between the primary node and the standby node;

A determining unit is configured to determine that the primary node has failed if a heartbeat message from the primary node is not received through the link.
The apparatus of claim 6 further comprising:

The second detecting unit is configured to: after detecting that the active node is not working normally, before detecting the active/standby switching operation, continue to detect whether the working through the active node is working normally in a preset waiting time;

The switching module is configured to perform an active/standby switching operation if the primary node does not resume normal operation within the waiting time.
The apparatus of claim 8 further comprising:

The first sending unit is configured to forward the received service request to the active node if the primary node resumes normal operation within the waiting time.
The apparatus of claim 8 further comprising:

a second sending unit, configured to send, when the waiting time, a service request sent by the client, send a service response corresponding to the service request to the client, where the service response includes The information that the service requested service is currently unavailable.
A highly available cluster comprising a first node and a second node comprising the apparatus of any of claims 6-10.
The high availability cluster according to claim 11, wherein the first node is configured to notify the second node if the first node resumes operation after the primary node switches to become the primary node. And initiating a state switching request, and after receiving the consent message of the second node, performing an operation that the first node becomes a primary node.