CN110661599B

CN110661599B - HA implementation method, device and storage medium between main node and standby node

Info

Publication number: CN110661599B
Application number: CN201810687830.4A
Authority: CN
Inventors: 朱骏
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2022-04-29
Anticipated expiration: 2038-06-28
Also published as: CN110661599A

Abstract

The embodiment of the invention discloses a method, a device and a storage medium for realizing HA between a main node and a standby node, belonging to the technical field of communication. The method comprises the following steps: monitoring communication conditions between the arbitration node and a main node and a standby node of a communication service and between the main node and the standby node by the arbitration node; analyzing whether the states of the main node and the standby node are effective or not according to the monitoring result; and performing coordination management on the master node and the standby node of the communication service according to the states of the master node and the standby node. By adopting the embodiment of the invention, when the master node or the standby node is abnormal in the cloud network, the rapid master-standby switching can be realized without depending on a hardware channel.

Description

HA implementation method, device and storage medium between main node and standby node

Technical Field

The embodiment of the invention relates to the technical field of communication, in particular to a High Availability (HA) realization method, device and storage medium for a main node and a standby node.

Background

In order to improve service reliability, system communication devices usually adopt a form of a master node and a slave node, and services are respectively deployed on the master node and the slave node. At ordinary times, only the service on the main node works, and when the service on the main node or the main node is abnormal, the standby node is quickly switched to the main node to take over the service work on the original main node, so that the service is ensured not to be interrupted.

On the existing physical network device (PNF), the master node and the standby node are usually hardware single boards (physical CPUs), the offline or the abnormality of the hardware single board can be quickly sensed through a hardware channel, and the abnormality of the node (virtual machine or container) under the cloud network (VNF) often does not have the sensing channel.

Disclosure of Invention

In view of this, embodiments of the present invention provide an HA implementation method and apparatus for a host device and a standby device, and a storage medium, so as to solve a problem that when a node in a cloud network is abnormal, the node in the prior art often cannot sense the abnormality through a hardware channel to implement the HA.

The technical scheme adopted by the embodiment of the invention for solving the technical problems is as follows:

according to a first aspect of the embodiments of the present invention, a method for implementing an HA between a master node and a standby node is provided, including:

monitoring communication conditions between the arbitration node and a main node and a standby node of a communication service and between the main node and the standby node by the arbitration node;

analyzing whether the states of the main node and the standby node are effective or not according to the monitoring result;

and performing coordination management on the master node and the standby node of the communication service according to the states of the master node and the standby node.

According to a second aspect of the embodiments of the present invention, there is provided an apparatus for implementing an HA between a master node and a standby node, the apparatus including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method according to the first aspect.

According to a third aspect of embodiments of the present invention there is provided a storage medium storing one or more programs executable by one or more processors to perform the steps of the method according to the first aspect.

The method, the device and the storage medium for realizing the HA between the main node and the standby node of the embodiment of the invention judge the states of the main node and the standby node by monitoring the communication conditions between the arbitration node and the main node and the standby node of the communication service and between the main node and the standby node, and carry out coordination management on the main node and the standby node according to the states of the main node and the standby node, thereby realizing rapid main-standby switching without depending on a hardware channel when the main node or the standby node is abnormal in a cloud network.

Drawings

Fig. 1 is a flowchart of an HA implementation method between a master node and a standby node according to an embodiment of the present invention;

fig. 2 is a logical structure diagram of an HA of a primary node and a standby node according to a first embodiment of the present invention;

fig. 3 is a schematic diagram illustrating switching between a main node and a standby node when communication links of nodes are normal in the first embodiment of the present invention;

fig. 4 is a schematic diagram illustrating an operation when a communication link between a master node and a slave node is abnormal according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an operation of a master node, a slave node, and an arbitration node when a communication link between the master node and the slave node is abnormal according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating an operation when a communication link between a standby node and a master node and an arbitration node is abnormal according to an embodiment of the present invention;

fig. 7 is a schematic diagram illustrating an operation of a communication link between two nodes being abnormal according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a module result of an HA implementation apparatus between a master node and a standby node according to a second embodiment of the present invention.

The implementation, functional features and advantages of the objects of the embodiments of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention clearer and more obvious, the embodiments of the present invention are described in further detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the embodiments of the invention and are not limiting of the embodiments of the invention.

An embodiment of the present invention provides a method for implementing an HA between a master node and a slave node, please refer to fig. 1, where the method includes:

step S101, an arbitration node monitors the communication conditions between the arbitration node and a main node and a standby node of a communication service and between the main node and the standby node;

step S102, whether the states of the main node and the standby node are effective or not is analyzed according to the monitoring result;

step S103, the master node and the standby node of the communication service are coordinated and managed according to the states of the master node and the standby node.

In practical application, the HA logical structure diagram of the master node and the HA logical structure diagram can refer to fig. 2. In order to implement the HA logical structure of fig. 2, a high-reliability auxiliary process is deployed in a certain named node in advance, so that the named node becomes an arbitration node. The named node is a virtual network system, and each node of the network system knows the existence of the named node and can communicate with the named node. In this embodiment, the high-reliability process may also be replaced by a thread or other execution entity, and is collectively referred to as a HAHelp for convenience of description. The HAHelp is used for assisting the communication service to complete the functions of election of the main node and the standby node, transferring the standby node to the main node and the like. A high reliability execution thread is then deployed among all nodes (which may include arbitration nodes). The high-reliability execution thread is usually located in the boot management or root process of the node, and may be a separate process called HAClient. The HAClient is used for monitoring and managing the service state of the node.

The arbitration node selects the main and the standby HAClients by using the HAHelp, and the nodes where the main and the standby HAClients are located become the main and the standby nodes. Specifically, the HAHelp scans all the alternative nodes, selects two nodes as a master node and a standby node according to the running condition of each node in a given time limit, and records the positions of the master node and the standby node.

The arbitration node and the selected main node and the standby node keep monitoring the communication condition in real time to realize keep-alive monitoring, and once the node where the main node is located is abnormal, the standby node completes the actions of standby-to-main and the like under the assistance of the arbitration node and the partner node. It should be noted that, in this embodiment, the implementation of the HA among the arbitration node, the master node, and the standby node is implemented by HAHelp and HAClient.

In practical application, in the embodiment, arbitrating the coordination management of the master node and the standby node includes, when the state of the master node and/or the standby node is invalid, resetting the master node and/or the standby node, and notifying the standby node to convert into the master node or reselect the master node and/or the standby node; and when the states of the main node and the standby node are effective, maintaining the main node and the standby node unchanged.

In order to simplify the model, the embodiment is described by taking a pair of master and slave nodes as an example, but the method is also applicable to high-reliability management of a plurality of pairs of master and slave nodes, that is, one arbitration node can coordinate management of a plurality of pairs of master and slave nodes.

In a possible scheme, the step S102 of analyzing whether the states of the master node and the standby node are valid according to the monitoring result includes:

step S1021, if the communication links between the arbitration node and the master node and the standby node are normal, determining whether the states of the master node and the standby node are valid according to the running conditions of the master node and the standby node;

step S1022, if the communication links between the arbitration node and the master node and the standby node are all abnormal, determining whether the states of the master node and the standby node are valid according to whether the communication link between the master node and the standby node is abnormal;

step S1023, if the communication link between the arbitration node and one of the main node and the standby node is normal, determining whether the states of the main node and the standby node are effective according to the communication states of the arbitration node and the main node and the standby node.

In a feasible scheme, step S1021, "if the communication links between the arbitration node and the master node and the slave node are normal, determine whether the states of the master node and the slave node are valid according to the operating conditions of the master node and the slave node", includes the following two scenarios:

firstly, when a service exception notification sent by a main node is received or the main node is detected to be reset, determining that the main node is invalid;

and secondly, if receiving a service abnormity notification sent by the standby node or detecting that the standby node is reset, confirming that the state of the standby node is invalid.

In one possible implementation, before performing step S1021, the method further includes:

the main node or the standby node generates self-detectable abnormity and sends service abnormity notification to the arbitration node; or

And if the master node or the standby node is abnormal, the cloud network forcibly resets the master node and the standby node.

In practical applications, if the links between the arbitration node and the master and slave nodes, and between the master node and the slave node are normal (as shown in fig. 2), that is, under the condition that the communication link is normal, as long as the master and slave haclients can reach each other, even if the HAHelp is crashed for a long time or cannot reach, the current master HAClient is still valid.

Referring to fig. 3, if a detectable anomaly occurs in the master node (the HAClient itself is normal), the master HAClient detects the anomaly of the master service and actively notifies the HAHelp to initiate switching, the HAHelp resets the node where the master HAClient is located, and then notifies the slave HAClient to switch to the master, which needs to be switched quickly to meet the requirement of Non-stop routing (NSR); if the host node is in an undetectable abnormity (the HAClient is abnormal per se), the cloud network quickly resets the node after detecting the abnormity, and the HAHelp informs the HAClient to transfer to the master after detecting the reset of the host node.

In a possible scheme, the step S1023, "if at least one communication link between the arbitration node and the master node and the standby node is normal, determine whether the states of the master node and the standby node are valid according to the communication states of the three nodes, and includes the following scenarios:

if the communication link between the arbitration node and the main node is abnormal, when the main node unreachable notice sent by the standby node is received, the state of the main node is determined to be invalid, otherwise, the states of the main node and the standby node are determined to be valid.

Referring to fig. 5, at this time, the communication link between the arbitration node and the master node is abnormal, and the communication link between the standby node and the master node is normal. And the HAClient informs the HAHelp that the main HAClient can not reach, the HAHelp detects whether the original main node still exists after receiving the notice that the main HAClient can not reach and the HAClient sends, if so, the HAClient informs the cloud network to reset the original main node, and then informs the standby node to convert into the main node and reselects the standby node.

In practical application, when the main HAClient finds that the condition that both the HAHelp and the main HAClient are unreachable lasts for a preset time, the node can be reset, namely, the main suicide.

If the communication link between the arbitration node and the standby node is abnormal, when the unreachable notice of the standby node sent by the main node is received, the state of the standby node is determined to be invalid, otherwise, the states of the main node and the standby node are determined to be valid.

Referring to fig. 6, at this time, the communication link between the arbitration node and the master node is abnormal, the communication link between the arbitration node and the slave node is normal, the master hacient notifies the HAHelp that the slave hacient is unreachable, the HAHelp detects whether the original slave node still exists after receiving the notification that the slave hacient is unreachable, which is sent by the master hacient, and if the original slave node exists, the HAHelp notifies the cloud network to reset the original slave node, and then reselects the slave node.

In practical application, when the HAClient finds that the condition that neither the HAHelp nor the main HAClient is reachable lasts for a preset time, the node of the HAClient can be reset, namely, the HAClient suicide is prepared.

And if the communication links between the arbitration node and the main node and between the arbitration node and the standby node are normal, when receiving an unreachable notification of the opposite node sent by the main node or the standby node, determining that the state of the standby node is invalid.

Referring to fig. 4, if the links between the arbitration node and the master and slave nodes are normal, and the communication link between the master node and the slave node is abnormal (i.e. communication is not reachable), at this time, the master and slave haclients find that the opposite end is not reachable, send a master and slave unreachable notification to the hahellp to confirm their own connectivity, and reset the node where they are located if the hahellp is not reachable. The HAHelp receives the main and standby unreachable notification of one of the main and standby HAClients to start detecting the connectivity of the main HAClient, and if the main HAClient can reach, the HAHelp notifies the standby HAClient to reset the node.

With reference to the second scenario of step S1021, it can be seen that, in this embodiment, if the HAHelp and the master HAClient are reachable, the standby node is reset (if the standby node is not reachable, the standby node may be reset by using the cloud network) no matter whether the HAHelp and the standby HAClient are reachable, and then the standby node is reselected.

In practical application, if the communication link between the main and the standby haclients is normal, and the communication link between the HAHelp and one of the main and the standby haclients is also normal, the HAHelp confirms that the main and the standby nodes are effective, and does not perform any processing.

In a possible scheme, in step S1022, if the communication links between the arbitration node and the master node and the standby node are both abnormal, determining whether the states of the master node and the standby node are valid according to whether the communication link between the master node and the standby node is abnormal, including:

if the communication links between the arbitration node and the main and standby nodes are abnormal, if one of the nodes is detected to be reset within the preset time, the states of the main and standby nodes are confirmed to be invalid, otherwise, the arbitration node is reset, so that the cloud network reselects the arbitration node.

Under the situation, the HAHelp and the main and the auxiliary HAClients are not reachable, and the two conditions are divided according to the communication link between the main and the auxiliary HAClients:

if the primary and secondary haclients are unreachable, please refer to fig. 7, at this time, the primary and secondary nodes are considered to be failed, and the primary and secondary nodes need to be reselected.

Under the condition, the HAHelp cannot actively inform the main node and the standby node to reset, the failure node initiates suicide action, namely, the HAClient detects that neither the HAHelp nor the partner HAClient can reach the preset time limit and resets the node. As long as one HAClient is normal and the reset node is successful, the HAHelp detects that the node is reset, then the HAClient and the backup node are considered to be invalid, re-election is initiated immediately to elect new main and backup nodes, and the cloud network is informed to reset the other node.

If the main and standby HAClients can not reset the nodes because of the abnormality which is not detected by the HAHelp, the HAHelp can not judge the states of the main and standby nodes, the condition is called as arbitration (HAHelp) failure, in order to prevent the situation that no main node exists all the time, the HAHelp needs to inform the cloud network to reset the main and standby nodes and reselect, but if the communication link of the HAHelp is a problem, the false detection can occur, in order to reduce the false detection probability,

and the HAHelp carries out migration, if the number of times of migration reaches a preset value, the communication link with the main node or the standby node cannot be recovered, and then the cloud network is informed to reset.

In practical application, the HAhelp performs migration, that is, the arbitration node is actively reset, and the cloud network reselects a new arbitration node.

And (II) the main node and the standby node can be reached, and the main node and the standby node are considered to be effective and need to be maintained in the case. And the HAHelp can not normally sense the things because of the connectivity of the HAHelp, as described above, the HAHelp migrates the HAHelp, the communication can be recovered if the main node and the standby node are normal, and if the migration is still the same for several times, the main node and the standby node are forcibly reset to initiate re-election.

The method comprises the steps that undetected abnormity occurs on a main HAClient and a standby HAClient at the same time, the HAHelp and the main HAClient and the standby HAClient can not reach in the scene, the main node and the standby node are not reset, the HAHelp finds that the main node and the standby node exist but can not confirm whether communication is connected (actually disconnected), the HAHelp automatically migrates to other nodes (nodes except the main node and the standby node) at the moment and then detects the connectivity of the main node and the standby HAClient, the migration is continued if the HAHelp still cannot reach, the main HAClient and the standby HAClient are confirmed to be abnormal after the migration times reach a preset value, the original main node and the standby node are reset, and new main nodes and new standby nodes are elected.

Therefore, after "resetting the arbitration node to cause the cloud network to reselect the arbitration node" is performed, the method further includes:

and if the frequency of selecting the arbitration node by the cloud network reaches a preset value, if the communication link between the arbitration node selected by the cloud network and the main node and the communication link between the arbitration node selected by the cloud network and the main node are still abnormal, the states of the main node and the standby node are confirmed to be invalid.

In a possible solution, the step "performing coordinated management on the master node and the standby node of the communication service according to the states of the master node and the standby node" includes:

and when the state of the main node and/or the standby node is invalid, resetting the main node and/or the standby node, informing the standby node of converting into the main node or re-electing the main node and/or the standby node.

Please refer to table 1, which lists the correspondence between the communication connection status of each node in the present embodiment and the way of the arbitration node to coordinate and manage the master node and the slave node, where T in the table indicates that the communication connection is normal, and F indicates that the communication connection is abnormal.

TABLE 1

The HA implementation method between the master node and the standby node in this embodiment judges the states of the master node and the standby node by monitoring the communication conditions between the arbitration node and the master node and the standby node of the communication service and between the master node and the standby node, and performs coordination management on the master node and the standby node according to the states of the master node and the standby node, so that when a master node or the standby node is abnormal in a cloud network, a fast master-standby switching is achieved without depending on a hardware channel, and the occurrence of a double master node can be avoided occasionally.

On the basis of the foregoing embodiment, a second embodiment of the present invention provides an apparatus for implementing an HA between a master node and a standby node, referring to fig. 8, where the apparatus includes: a memory 801, a processor 802 and a computer program 803 stored on the memory 801 and executable on the processor 802, the computer program 803 realizing the steps of the method according to the first embodiment when executed by the processor 802.

On the basis of the foregoing embodiment, a third embodiment of the present invention provides a storage medium, where the storage medium includes a stored program, and the program, when running, controls a device on which the storage medium is located to perform the operations according to the first embodiment.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not intended to limit the scope of the embodiments of the invention. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present invention are intended to be within the scope of the claims of the embodiments of the present invention.

Claims

1. A method for realizing HA between a main node and a standby node comprises the following steps:

analyzing whether the states of the main node and the standby node are effective or not according to the monitoring result; it includes:

if the communication links between the arbitration node and the main node and the standby node are normal, determining whether the states of the main node and the standby node are valid according to the operating conditions of the main node and the standby node;

if the communication links between the arbitration node and the main and standby nodes are abnormal, determining whether the states of the main and standby nodes are valid according to whether the communication links between the main node and the standby nodes are abnormal;

if at least one communication link between the arbitration node and the main node and the standby node is normal, determining whether the states of the main node and the standby node are effective or not according to the communication states of the arbitration node, the main node and the standby node;

2. The method as claimed in claim 1, wherein the determining whether the states of the master node and the standby node are valid according to the operating conditions of the master node and the standby node if the communication links between the arbitration node and the master node and between the standby node are normal comprises:

if a service exception notification sent by the main node is received or the main node is detected to be reset, determining that the main node is invalid;

and if a service abnormity notification sent by the standby node is received or the standby node is detected to be reset, confirming that the state of the standby node is invalid.

3. The method as claimed in claim 2, wherein if the communication links between the arbitration node and the master node and the slave node are normal, before determining whether the states of the master node and the slave node are valid according to the operating conditions of the master node and the slave node, the method further comprises:

And if the main node or the standby node is abnormal, the cloud network forcibly resets the main node or the standby node.

4. The method as claimed in claim 1, wherein the determining whether the states of the master node and the standby node are valid according to the communication states of the arbitration node, the master node and the standby node if at least one communication link between the arbitration node and the master node is normal comprises:

if the communication link between the arbitration node and the main node is abnormal, when a main node unreachable notice sent by the standby node is received, determining that the state of the main node is invalid;

if the communication link between the arbitration node and the standby node is abnormal, when a standby node unreachable notice sent by the main node is received, determining that the state of the standby node is invalid;

5. The method for implementing HA between a master node and a slave node according to claim 1, wherein if the communication links between the arbitration node and the master node and the slave node are both abnormal, determining whether the states of the master node and the slave node are valid according to whether the communication links between the master node and the slave node are abnormal, comprises:

if the communication links between the arbitration node and the main node and the standby node are abnormal, if one of the nodes is detected to be reset within preset time, the states of the main node and the standby node are confirmed to be invalid, otherwise, the arbitration node is reset, so that the cloud network reselects the arbitration node.

6. The method for implementing HA between a master node and a slave node as claimed in claim 5, wherein after confirming that the states of both the master node and the slave node are invalid, the method further comprises: notifying the cloud network to reset another node;

after the resetting the arbitration node to cause the cloud network to reselect the arbitration node, the method further comprises:

and if the frequency of selecting the arbitration node by the cloud network reaches a preset value, if the communication link between the arbitration node selected by the cloud network and the main node or the communication link between the arbitration node selected by the cloud network and the standby node are still abnormal, the states of the main node and the standby node are confirmed to be invalid.

7. The method for implementing HA between a master node and a slave node as claimed in claim 1, wherein said method further comprises:

when the main node detects that the communication links between the main node and the corresponding standby node and the communication links between the main node and the arbitration node are abnormal, resetting the node per se;

and when the standby node detects that the communication links between the standby node and the corresponding main node and the arbitration node are abnormal, resetting the node per se.

8. The HA implementing method between the master node and the standby node as claimed in any one of claims 1 to 7, wherein the performing coordinated management on the master node and the standby node of the communication service according to the states of the master node and the standby node comprises:

and when the state of the main node and/or the standby node is invalid, resetting the main node and/or the standby node, and informing the standby node of converting into the main node or re-electing the main node and/or the standby node.

9. An apparatus for implementing HA between a master node and a standby node, the apparatus comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement steps of a method as claimed in any one of claims 1 to 7.