WO2019128670A1

WO2019128670A1 - Method and apparatus for enabling self-recovery of management capability in distributed system

Info

Publication number: WO2019128670A1
Application number: PCT/CN2018/119528
Authority: WO
Inventors: 何东杰
Original assignee: 中国银联股份有限公司
Priority date: 2017-12-28
Filing date: 2018-12-06
Publication date: 2019-07-04
Also published as: TW201931821A; TWI701916B; CN108306760A

Abstract

The present invention relates to network technologies, and especially to a method for enabling self-recovery of a management capability in a distributed system, an apparatus for implementing the method, and a computer-readable storage medium containing a computer program implementing the method. In a method for enabling self-recovery of a management capability in a distributed system according to one aspect of the present invention, the distributed system comprises a management node group and a serving node group. The method contains the following steps: if it is monitored that a management node in a management node group has failed, then removing, from the management node group, the management node that failed; choosing, from the serving node group, a serving node having high availability as a new management node; and adding the new management node to the management node group.

Description

Method and apparatus for self-healing management capabilities in a distributed system

Technical field

The present invention relates to network technologies, and more particularly to a method for self-healing management capabilities in a distributed system, an apparatus for implementing the method, and a computer readable storage medium comprising the computer program implementing the method.

Background technique

Distributed architecture has become a major trend in the development of current information systems architecture. In the existing distributed architecture, the management node and the service node are generally designed. The high availability of the management node and the high availability of the service node constitute high availability of the entire system.

In a distributed system, the management node is responsible for supporting and guaranteeing the high availability of the service node, which generally adopts a two-node, three-node cluster, and the like. The service node is protected by the management node to ensure high availability of the service, so that when the service node fails, the overall service capability is not affected. Obviously, the high availability of the management node is the core of the entire system.

In the existing technical solution, when the management node fails, the continuity of the service is usually ensured by the switching between the active/standby switchover or the active-active service. However, after system switching, the original high availability will no longer exist, which generally requires manual recovery, so it is less efficient.

Summary of the invention

It is an object of the present invention to provide a method and apparatus for self-healing management capabilities in a distributed system that has the advantages of ease of implementation and strong recovery capabilities.

In a method for self-healing management capabilities in a distributed system in accordance with an aspect of the present invention, the distributed system includes a management node group and a service node group, the method comprising the steps of:

If a management node in the management node group is detected to be faulty, the failed management node is removed from the management node group;

Selecting a service node with high availability from the set of service nodes as a new management node;

Add a new management node to the management node group.

Preferably, in the above method, the step of selecting a new management node comprises:

Causing each service node to send a request to the remaining nodes in the distributed architecture;

Causeting the node receiving the request to return a billing confirmation response based on the blockchain accounting confirmation mechanism;

A service node with high availability is selected as a new management node based on a confirmation response to the request sent by each service node.

Preferably, in the above method, the high availability is represented by a response success rate and/or an response average time of the transmitted request within a set time period and has a higher response success rate and/or a set time period. The service node that responds to the average time is selected as the new management node.

Obtaining network communication data for each service node in the process of providing the service;

A service node with high availability is selected as a new management node according to network communication data of each service node.

Preferably, in the above method, the high availability selects an average of response times between each service node and other nodes within a set time period and selects a service node having an average of the shortest response times as a new one. Management node.

Obtaining the resource usage of each service node in the process of providing the service;

A service node with high availability is selected as a new management node according to the resource usage of each service node.

Preferably, in the above method, the high availability is represented by an average resource utilization of each service node within a set time period and the service node having the lowest resource utilization is selected as a new management node.

In an apparatus for self-healing management capabilities in a distributed system in accordance with another aspect of the present invention, the distributed system includes a management node group and a service node group, the apparatus comprising:

a first module, configured to remove a failed management node from the management node group if a management node in the management node group is detected to be faulty;

a second module, configured to select a service node with high availability from the service node group as a new management node;

The third module is configured to join the new management node to the management node group.

In an apparatus for self-healing management capabilities in a distributed system in accordance with another aspect of the present invention, the distributed system includes a management node group and a service node group, the apparatus including a memory, a processor, and a storage A computer program on the memory and operative on the processor to perform the method as described above.

It is still another object of the present invention to provide a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method as described above.

The present invention has many advantages over the prior art. For example, when the distributed system switches modes according to the failure of the management node, the method and apparatus according to the above aspects of the present invention can quickly and automatically restore the original high availability of the system, thereby greatly improving system maintenance efficiency and improving system high availability. The degree of protection.

DRAWINGS

The above and/or other aspects and advantages of the present invention will be more clearly understood and understood from The drawings include:

FIG. 1 is a schematic diagram of the architecture of a distributed system.

2 is a flow diagram of a method for self-healing management capabilities in a distributed system, in accordance with one embodiment of the present invention.

3 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.

4 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.

5 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.

6 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.

7 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.

Detailed ways

The invention will now be described more fully hereinafter with reference to the accompanying drawings However, the invention may be embodied in different forms and should not be construed as limited to the various embodiments presented herein. The above-described embodiments are intended to be complete and complete to convey the scope of the present invention to those skilled in the art.

In the present specification, the terms "including" and "including" are used to mean that the present invention does not exclude the direct Or the case of other units and steps that are expressly stated.

FIG. 1 is a schematic diagram of the architecture of a distributed system. Illustratively, the distributed system 10 shown in FIG. 1 includes

management nodes

110a and 110b (these nodes constitute a management node group) and service nodes 120a-120h (these nodes constitute a service node group). In the distributed system shown, each node can directly implement a communication connection or implement a communication connection via a third party node.

The management node group is in high availability mode under normal conditions. The high availability modes described herein include a active mode, a multiple active mode, and an active/standby mode, in which each management node (eg,

nodes

110a and 110b) within the management node group is active in the active mode and the multiple active mode. In the active/standby mode, one of the

management nodes

110a and 110b (e.g., node 110a) is designated as the master node and the remaining management nodes (e.g., 110b) are designated as the standby node.

When it is detected that a management node (for example, node 110a) in the management node group fails, in order to ensure the normal provision of the service, the failed management node will be removed from the management node group. In the above example, management node 110b will be the only available management node. At this time, high availability modes such as the active mode and the active and standby modes will no longer be available, thereby affecting the high availability of the entire distributed system 10.

In accordance with an aspect of the present invention, in order to restore the high availability of the distributed system 10, a suitable service node (e.g., service node 120a) may be selected from the group of service nodes as a new management node to replace the failed management node, thereby enabling management. The node group enters high availability mode again. For example, in the new active/standby mode,

nodes

110b and 120a serve as primary and standby nodes, respectively; in active-active mode,

nodes

110b and 120a are backups of each other.

2 is a flow diagram of a method for self-healing management capabilities in a distributed system, in accordance with one embodiment of the present invention. Illustratively, the method of the present embodiment is described herein by taking the distributed system shown in FIG. 1 as an example. It should be noted, however, that the method of the present embodiment is not limited to a distributed system of a specific architecture. It should be noted that the various steps of the method of the present embodiment may be performed separately or in concert by hardware devices or software modules deployed on one or more nodes in the distributed system 20, or may be independent of the distributed system 20. The device or module of each node is executed.

Referring to FIG. 2, in step S210, it is monitored whether there is a management node failure in the management node group. If a failed management node (for example, node 110a) is detected, the process proceeds to step S220, otherwise the monitoring is continued.

At step S220, the failed management node 110a is removed from the management node group. At this time, for the service nodes 120a-120h, only the management node 110b is responsible for supporting and securing the service node, and thus the high availability mode is unavailable.

Then, proceeding to step S230, a service node (e.g., node 120a) having high availability is selected from the group of service nodes as a new management node. The manner of selection will be described in detail below.

Next, proceeding to step S240, the new management node 120a is added to the management node group. As a result, the management node group can enter the high availability mode again. For example, in the active/standby mode, the

nodes

110b and 120a serve as the primary node and the standby node, respectively; in the active-active mode, the

nodes

110b and 120a are backups of each other.

3 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.

As shown in FIG. 3, in step S310, each service node is caused to send a request to the remaining nodes (including the management node and the service node) in the distributed architecture.

Then, proceeding to step S320, the node receiving the request returns a billing confirmation response based on the blockchain accounting confirmation mechanism.

Next, proceeding to step S330, a service node having high availability is selected as a new management node according to a confirmation response to the request transmitted by each service node.

In step S330, preferably, the high availability may be represented by a response success rate and/or a response average time for the transmitted request within a set time period, and may have a higher response success rate and/or within the set time period. Or the service node that responds to the average time is selected as the new management node. Illustratively, a score of a combination of the response success rate and the response average time may be determined for each serving node (the score may be, for example, a weighted sum of the reciprocal of the response average time and the response success rate), and the service node with the highest score is selected as the new Management node.

4 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.

As shown in FIG. 4, in step S410, network communication data in the process of providing a service by each service node is acquired.

Then, proceeding to step S420, a service node having high availability is selected as a new management node according to network communication data of each service node.

In step S420, preferably, high availability may be represented by an average of response times between each serving node and other nodes within a set time period, and the service node having the average of the shortest response times is selected as New management node.

5 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.

As shown in FIG. 5, in step S510, resource usage of each service node in the process of providing a service is acquired.

Then, proceeding to step S520, the service node with high availability is selected as the new management node according to the resource usage of each service node.

In step S520, preferably, high availability may be represented by high availability in an average resource utilization of each service node within a set time period, and the service node having the lowest resource utilization is selected as a new management node.

As shown in FIG. 6, the apparatus 60 for self-restoring management capabilities in a distributed system of the present embodiment includes a first module 610, a second module 620, and a third module 630. The first module 610 is configured to remove the failed management node from the management node group if a management node in the management node group is detected to be faulty; and the second module 620 is configured to select high availability from the service node group. The service node acts as a new management node; the third module 630 is used to join the new management node to the management node group.

The apparatus 70 shown in FIG. 7 includes a memory 70, a processor 720, and a computer program 730 stored on the memory 70 and operative on the processor 720, wherein the computer program 730 is executable by operating on the processor 720 The method of the embodiment described in Figures 2-5.

According to an aspect of the invention, there is provided a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method of the embodiment described with reference to Figures 2-5.

The embodiments and examples set forth herein are provided to best illustrate the embodiments of the present invention and the specific application thereof, and thereby enabling those skilled in the art to make and use the invention. However, those skilled in the art will appreciate that the above description and examples are provided for ease of illustration and illustration. The description is not intended to be exhaustive or to limit the invention.

In view of the above, the scope of the present disclosure is determined by the following claims.

Claims

A method for self-healing management capabilities in a distributed system, the distributed system comprising a management node group and a service node group, wherein the method comprises the following steps:

If a management node in the management node group is detected to be faulty, the failed management node is removed from the management node group;

Selecting a service node with high availability from the set of service nodes as a new management node;

Add a new management node to the management node group.
The method of claim 1 wherein the step of selecting a new management node comprises:

Causing each service node to send a request to the remaining nodes in the distributed architecture;

Causeting the node receiving the request to return a billing confirmation response based on the blockchain accounting confirmation mechanism;

A service node with high availability is selected as a new management node based on a confirmation response to the request sent by each service node.
The method of claim 2 wherein said high availability is expressed in response time success rate and/or response average time over a set time period and has a higher response success rate during a set time period And/or the service node that responds to the average time is selected as the new management node.
The method of claim 1 wherein the step of selecting a new management node comprises:

Obtaining network communication data for each service node in the process of providing the service;

A service node with high availability is selected as a new management node according to network communication data of each service node.
The method of claim 4, wherein said high availability sets an average of response times between each of the service nodes and other nodes within a set time period and selects a service node having an average of the shortest response times For the new management node.
The method of claim 1 wherein the step of selecting a new management node comprises:

Obtaining the resource usage of each service node in the process of providing the service;

A service node with high availability is selected as a new management node according to the resource usage of each service node.
The method of claim 6, wherein the high availability is represented by an average resource utilization of each service node within a set time period and the service node having the lowest resource utilization is selected as a new management node.
An apparatus for self-recovering management capabilities in a distributed system, the distributed system comprising a management node group and a service node group, wherein the apparatus comprises:

a first module, configured to remove a failed management node from the management node group if a management node in the management node group is detected to be faulty;

a second module, configured to select a service node with high availability from the service node group as a new management node;

The third module is configured to join the new management node to the management node group.
The apparatus of claim 8 wherein the apparatus is deployed within a single or a plurality of nodes of a distributed system.
An apparatus for self-healing management capabilities in a distributed system, the distributed system comprising a management node group and a service node group, the apparatus comprising a memory, a processor, and being stored on the memory and A computer program running on a processor, characterized in that the method of any one of claims 1-7 is performed.
The apparatus of claim 10 wherein the apparatus is deployed within a single or a plurality of nodes of a distributed system.
A computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor to implement the method of any of claims 1-7.