WO2019128670A1 - Method and apparatus for enabling self-recovery of management capability in distributed system - Google Patents

Method and apparatus for enabling self-recovery of management capability in distributed system Download PDF

Info

Publication number
WO2019128670A1
WO2019128670A1 PCT/CN2018/119528 CN2018119528W WO2019128670A1 WO 2019128670 A1 WO2019128670 A1 WO 2019128670A1 CN 2018119528 W CN2018119528 W CN 2018119528W WO 2019128670 A1 WO2019128670 A1 WO 2019128670A1
Authority
WO
WIPO (PCT)
Prior art keywords
management node
node
service
distributed system
service node
Prior art date
Application number
PCT/CN2018/119528
Other languages
French (fr)
Chinese (zh)
Inventor
何东杰
Original Assignee
中国银联股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国银联股份有限公司 filed Critical 中国银联股份有限公司
Publication of WO2019128670A1 publication Critical patent/WO2019128670A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Definitions

  • the present invention relates to network technologies, and more particularly to a method for self-healing management capabilities in a distributed system, an apparatus for implementing the method, and a computer readable storage medium comprising the computer program implementing the method.
  • the management node In a distributed system, the management node is responsible for supporting and guaranteeing the high availability of the service node, which generally adopts a two-node, three-node cluster, and the like.
  • the service node is protected by the management node to ensure high availability of the service, so that when the service node fails, the overall service capability is not affected.
  • the high availability of the management node is the core of the entire system.
  • the distributed system includes a management node group and a service node group, the method comprising the steps of:
  • the step of selecting a new management node comprises:
  • a service node with high availability is selected as a new management node based on a confirmation response to the request sent by each service node.
  • the high availability is represented by a response success rate and/or an response average time of the transmitted request within a set time period and has a higher response success rate and/or a set time period.
  • the service node that responds to the average time is selected as the new management node.
  • the step of selecting a new management node comprises:
  • a service node with high availability is selected as a new management node according to network communication data of each service node.
  • the high availability selects an average of response times between each service node and other nodes within a set time period and selects a service node having an average of the shortest response times as a new one. Management node.
  • the step of selecting a new management node comprises:
  • a service node with high availability is selected as a new management node according to the resource usage of each service node.
  • the high availability is represented by an average resource utilization of each service node within a set time period and the service node having the lowest resource utilization is selected as a new management node.
  • the distributed system includes a management node group and a service node group, the apparatus comprising:
  • a first module configured to remove a failed management node from the management node group if a management node in the management node group is detected to be faulty
  • a second module configured to select a service node with high availability from the service node group as a new management node
  • the third module is configured to join the new management node to the management node group.
  • the distributed system includes a management node group and a service node group, the apparatus including a memory, a processor, and a storage A computer program on the memory and operative on the processor to perform the method as described above.
  • the present invention has many advantages over the prior art. For example, when the distributed system switches modes according to the failure of the management node, the method and apparatus according to the above aspects of the present invention can quickly and automatically restore the original high availability of the system, thereby greatly improving system maintenance efficiency and improving system high availability. The degree of protection.
  • FIG. 1 is a schematic diagram of the architecture of a distributed system.
  • FIG. 2 is a flow diagram of a method for self-healing management capabilities in a distributed system, in accordance with one embodiment of the present invention.
  • FIG. 3 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.
  • FIG. 4 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.
  • FIG. 5 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.
  • FIG. 6 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.
  • FIG. 7 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.
  • FIG. 1 is a schematic diagram of the architecture of a distributed system.
  • the distributed system 10 shown in FIG. 1 includes management nodes 110a and 110b (these nodes constitute a management node group) and service nodes 120a-120h (these nodes constitute a service node group).
  • each node can directly implement a communication connection or implement a communication connection via a third party node.
  • the management node group is in high availability mode under normal conditions.
  • the high availability modes described herein include a active mode, a multiple active mode, and an active/standby mode, in which each management node (eg, nodes 110a and 110b) within the management node group is active in the active mode and the multiple active mode.
  • each management node e.g., nodes 110a and 110b
  • the active/standby mode one of the management nodes 110a and 110b (e.g., node 110a) is designated as the master node and the remaining management nodes (e.g., 110b) are designated as the standby node.
  • management node 110a When it is detected that a management node (for example, node 110a) in the management node group fails, in order to ensure the normal provision of the service, the failed management node will be removed from the management node group.
  • management node 110b In the above example, management node 110b will be the only available management node. At this time, high availability modes such as the active mode and the active and standby modes will no longer be available, thereby affecting the high availability of the entire distributed system 10.
  • a suitable service node e.g., service node 120a
  • the node group enters high availability mode again.
  • nodes 110b and 120a serve as primary and standby nodes, respectively; in active-active mode, nodes 110b and 120a are backups of each other.
  • FIG. 2 is a flow diagram of a method for self-healing management capabilities in a distributed system, in accordance with one embodiment of the present invention.
  • the method of the present embodiment is described herein by taking the distributed system shown in FIG. 1 as an example. It should be noted, however, that the method of the present embodiment is not limited to a distributed system of a specific architecture. It should be noted that the various steps of the method of the present embodiment may be performed separately or in concert by hardware devices or software modules deployed on one or more nodes in the distributed system 20, or may be independent of the distributed system 20. The device or module of each node is executed.
  • step S210 it is monitored whether there is a management node failure in the management node group. If a failed management node (for example, node 110a) is detected, the process proceeds to step S220, otherwise the monitoring is continued.
  • a failed management node for example, node 110a
  • the failed management node 110a is removed from the management node group.
  • the management node 110b is responsible for supporting and securing the service node, and thus the high availability mode is unavailable.
  • a service node (e.g., node 120a) having high availability is selected from the group of service nodes as a new management node. The manner of selection will be described in detail below.
  • the new management node 120a is added to the management node group.
  • the management node group can enter the high availability mode again.
  • the nodes 110b and 120a serve as the primary node and the standby node, respectively; in the active-active mode, the nodes 110b and 120a are backups of each other.
  • FIG. 3 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.
  • each service node is caused to send a request to the remaining nodes (including the management node and the service node) in the distributed architecture.
  • step S320 the node receiving the request returns a billing confirmation response based on the blockchain accounting confirmation mechanism.
  • a service node having high availability is selected as a new management node according to a confirmation response to the request transmitted by each service node.
  • the high availability may be represented by a response success rate and/or a response average time for the transmitted request within a set time period, and may have a higher response success rate and/or within the set time period.
  • the service node that responds to the average time is selected as the new management node.
  • a score of a combination of the response success rate and the response average time may be determined for each serving node (the score may be, for example, a weighted sum of the reciprocal of the response average time and the response success rate), and the service node with the highest score is selected as the new Management node.
  • FIG. 4 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.
  • step S410 network communication data in the process of providing a service by each service node is acquired.
  • a service node having high availability is selected as a new management node according to network communication data of each service node.
  • high availability may be represented by an average of response times between each serving node and other nodes within a set time period, and the service node having the average of the shortest response times is selected as New management node.
  • FIG. 5 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.
  • step S510 resource usage of each service node in the process of providing a service is acquired.
  • step S520 the service node with high availability is selected as the new management node according to the resource usage of each service node.
  • high availability may be represented by high availability in an average resource utilization of each service node within a set time period, and the service node having the lowest resource utilization is selected as a new management node.
  • FIG. 6 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.
  • the apparatus 60 for self-restoring management capabilities in a distributed system of the present embodiment includes a first module 610, a second module 620, and a third module 630.
  • the first module 610 is configured to remove the failed management node from the management node group if a management node in the management node group is detected to be faulty; and the second module 620 is configured to select high availability from the service node group.
  • the service node acts as a new management node; the third module 630 is used to join the new management node to the management node group.
  • FIG. 7 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.
  • the apparatus 70 shown in FIG. 7 includes a memory 70, a processor 720, and a computer program 730 stored on the memory 70 and operative on the processor 720, wherein the computer program 730 is executable by operating on the processor 720
  • the method of the embodiment described in Figures 2-5 The method of the embodiment described in Figures 2-5.
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method of the embodiment described with reference to Figures 2-5.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)
  • Multi Processors (AREA)

Abstract

The present invention relates to network technologies, and especially to a method for enabling self-recovery of a management capability in a distributed system, an apparatus for implementing the method, and a computer-readable storage medium containing a computer program implementing the method. In a method for enabling self-recovery of a management capability in a distributed system according to one aspect of the present invention, the distributed system comprises a management node group and a serving node group. The method contains the following steps: if it is monitored that a management node in a management node group has failed, then removing, from the management node group, the management node that failed; choosing, from the serving node group, a serving node having high availability as a new management node; and adding the new management node to the management node group.

Description

用于在分布式系统中使管理能力自恢复的方法和装置Method and apparatus for self-healing management capabilities in a distributed system 技术领域Technical field
本发明涉及网络技术,特别涉及用于在分布式系统中使管理能力自恢复的方法、实施该方法的装置以及包含实施该方法的计算机程序的计算机可读存储介质。The present invention relates to network technologies, and more particularly to a method for self-healing management capabilities in a distributed system, an apparatus for implementing the method, and a computer readable storage medium comprising the computer program implementing the method.
背景技术Background technique
分布式架构已经成为当前信息系统架构发展的主要趋势。现有的分布式架构中一般采用管理节点加上服务节点的方式进行设计,管理节点的高可用加上服务节点的高可用构成了整个系统的高可用性。Distributed architecture has become a major trend in the development of current information systems architecture. In the existing distributed architecture, the management node and the service node are generally designed. The high availability of the management node and the high availability of the service node constitute high availability of the entire system.
在分布式系统中,管理节点负责支撑和保障服务节点的高可用性,其一般采用双节点、三节点集群等方式。服务节点由管理节点来保障服务的高可用性,以当服务节点发生故障时不影响整体服务能力。显然,管理节点的高可用性是整个系统的核心。In a distributed system, the management node is responsible for supporting and guaranteeing the high availability of the service node, which generally adopts a two-node, three-node cluster, and the like. The service node is protected by the management node to ensure high availability of the service, so that when the service node fails, the overall service capability is not affected. Obviously, the high availability of the management node is the core of the entire system.
在现有的技术方案中,当管理节点发生故障时,通常通过主备切换或者双活服务的切换来保证服务的延续性。但是,在系统切换后,原先的高可用性将不复存在,这一般需要通过手动方式进行恢复,因此效率较低。In the existing technical solution, when the management node fails, the continuity of the service is usually ensured by the switching between the active/standby switchover or the active-active service. However, after system switching, the original high availability will no longer exist, which generally requires manual recovery, so it is less efficient.
发明内容Summary of the invention
本发明的一个目的是提供一种用于在分布式系统中使管理能力自恢复的方法和装置,其具有实施便捷和恢复能力强等优点。It is an object of the present invention to provide a method and apparatus for self-healing management capabilities in a distributed system that has the advantages of ease of implementation and strong recovery capabilities.
在按照本发明一个方面的用于在分布式系统中使管理能力自恢复的方法中,所述分布式系统包括管理节点组和服务节点组,所述方法包含下列步骤:In a method for self-healing management capabilities in a distributed system in accordance with an aspect of the present invention, the distributed system includes a management node group and a service node group, the method comprising the steps of:
如果监测到管理节点组内有管理节点发生故障,则将发生故障的管理节点从管理节点组中去除;If a management node in the management node group is detected to be faulty, the failed management node is removed from the management node group;
从所述服务节点组中选择具有高可用性的服务节点作为新的管理节点;以及Selecting a service node with high availability from the set of service nodes as a new management node;
将新的管理节点加入所述管理节点组。Add a new management node to the management node group.
优选地,在上述方法中,选择新的管理节点的步骤包括:Preferably, in the above method, the step of selecting a new management node comprises:
使每个服务节点向分布式架构中的其余节点发送请求;Causing each service node to send a request to the remaining nodes in the distributed architecture;
使收到请求的节点返回基于区块链记账确认机制的记账确认应答;以及Causeting the node receiving the request to return a billing confirmation response based on the blockchain accounting confirmation mechanism;
根据对每个服务节点发送的请求的确认应答选择具有高可用性的服务节点作为新的管理节点。A service node with high availability is selected as a new management node based on a confirmation response to the request sent by each service node.
优选地,在上述方法中,所述高可用性以设定时间周期内对所发送请求的应答成功率和/或应答平均时间来表示并且将设定时间周期内具有较高应答成功率和/或应答平均时间的服务节点选择为新的管理节点。Preferably, in the above method, the high availability is represented by a response success rate and/or an response average time of the transmitted request within a set time period and has a higher response success rate and/or a set time period. The service node that responds to the average time is selected as the new management node.
优选地,在上述方法中,选择新的管理节点的步骤包括:Preferably, in the above method, the step of selecting a new management node comprises:
获取每个服务节点在提供服务的过程中的网络通信数据;以及Obtaining network communication data for each service node in the process of providing the service;
根据每个服务节点的网络通信数据选择具有高可用性的服务节点作为新的管理节点。A service node with high availability is selected as a new management node according to network communication data of each service node.
优选地,在上述方法中,所述高可用性以设定时间周期内每个服务节点与其它节点之间的响应时间的平均值并且将具有最短的响应时间的平均值的服务节点选择为新的管理节点。Preferably, in the above method, the high availability selects an average of response times between each service node and other nodes within a set time period and selects a service node having an average of the shortest response times as a new one. Management node.
优选地,在上述方法中,选择新的管理节点的步骤包括:Preferably, in the above method, the step of selecting a new management node comprises:
获取每个服务节点在提供服务的过程中的资源使用情况;以及Obtaining the resource usage of each service node in the process of providing the service;
根据每个服务节点的资源使用情况选择具有高可用性的服务节点作为新的管理节点。A service node with high availability is selected as a new management node according to the resource usage of each service node.
优选地,在上述方法中,所述高可用性以设定时间周期内每个服务节点的平均资源利用率来表示并且将具有最低资源利用率的服务节点选择为新的管理节点。Preferably, in the above method, the high availability is represented by an average resource utilization of each service node within a set time period and the service node having the lowest resource utilization is selected as a new management node.
在按照本发明另一个方面的用于在分布式系统中使管理能力自 恢复的装置中,所述分布式系统包括管理节点组和服务节点组,所述装置包含:In an apparatus for self-healing management capabilities in a distributed system in accordance with another aspect of the present invention, the distributed system includes a management node group and a service node group, the apparatus comprising:
第一模块,用于如果监测到管理节点组内有管理节点发生故障,则将发生故障的管理节点从管理节点组中去除;a first module, configured to remove a failed management node from the management node group if a management node in the management node group is detected to be faulty;
第二模块,用于从所述服务节点组中选择具有高可用性的服务节点作为新的管理节点;以及a second module, configured to select a service node with high availability from the service node group as a new management node;
第三模块,用于将新的管理节点加入所述管理节点组。The third module is configured to join the new management node to the management node group.
在按照本发明另一个方面的用于在分布式系统中使管理能力自恢复的装置中,所述分布式系统包括管理节点组和服务节点组,所述装置包含存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序以执行如上所述的方法。In an apparatus for self-healing management capabilities in a distributed system in accordance with another aspect of the present invention, the distributed system includes a management node group and a service node group, the apparatus including a memory, a processor, and a storage A computer program on the memory and operative on the processor to perform the method as described above.
本发明的还有一个目的是提供一种计算机可读存储介质,其上存储计算机程序,该程序被处理器执行时实现如上所述的方法。It is still another object of the present invention to provide a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method as described above.
与现有技术相比,本发明具有诸多优点。例如,当分布式系统因管理节点发生故障而切换模式后,按照本发明上述各个方面的方法和装置能够快速自动恢复系统原来的高可用性,从而极大提升系统维护的效率并且提高了系统高可用性的保障程度。The present invention has many advantages over the prior art. For example, when the distributed system switches modes according to the failure of the management node, the method and apparatus according to the above aspects of the present invention can quickly and automatically restore the original high availability of the system, thereby greatly improving system maintenance efficiency and improving system high availability. The degree of protection.
附图说明DRAWINGS
本发明的上述和/或其它方面和优点将通过以下结合附图的各个方面的描述变得更加清晰和更容易理解,附图中相同或相似的单元采用相同的标号表示。附图包括:The above and/or other aspects and advantages of the present invention will be more clearly understood and understood from The drawings include:
图1为一种分布式系统的架构示意图。FIG. 1 is a schematic diagram of the architecture of a distributed system.
图2为按照本发明一个实施例的用于在分布式系统中使管理能力自恢复的方法的流程图。2 is a flow diagram of a method for self-healing management capabilities in a distributed system, in accordance with one embodiment of the present invention.
图3为按照本发明另一个实施例的选择新的管理节点的方法的流程图。3 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.
图4为按照本发明另一个实施例的选择新的管理节点的方法的流程图。4 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.
图5为按照本发明另一个实施例的选择新的管理节点的方法的流程图。5 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention.
图6为按照本发明另一个实施例的用于在分布式系统中使管理能力自恢复的装置的框图。6 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.
图7为按照本发明另一个实施例的用于在分布式系统中使管理能力自恢复的装置的框图。7 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.
具体实施方式Detailed ways
下面参照其中图示了本发明示意性实施例的附图更为全面地说明本发明。但本发明可以按不同形式来实现,而不应解读为仅限于本文给出的各实施例。给出的上述各实施例旨在使本文的披露全面完整,以将本发明的保护范围更为全面地传达给本领域技术人员。The invention will now be described more fully hereinafter with reference to the accompanying drawings However, the invention may be embodied in different forms and should not be construed as limited to the various embodiments presented herein. The above-described embodiments are intended to be complete and complete to convey the scope of the present invention to those skilled in the art.
在本说明书中,诸如“包含”和“包括”之类的用语表示除了具有在说明书和权利要求书中有直接和明确表述的单元和步骤以外,本发明的技术方案也不排除具有未被直接或明确表述的其它单元和步骤的情形。In the present specification, the terms "including" and "including" are used to mean that the present invention does not exclude the direct Or the case of other units and steps that are expressly stated.
图1为一种分布式系统的架构示意图。示例性地,图1所示的分布式系统10包括管理节点110a和110b(这些节点们组成管理节点组)和服务节点120a-120h(这些节点组成服务节点组)。在所示的分布式系统中,各个节点可以直接实现通信连接,或者经第三方节点实现通信连接。FIG. 1 is a schematic diagram of the architecture of a distributed system. Illustratively, the distributed system 10 shown in FIG. 1 includes management nodes 110a and 110b (these nodes constitute a management node group) and service nodes 120a-120h (these nodes constitute a service node group). In the distributed system shown, each node can directly implement a communication connection or implement a communication connection via a third party node.
管理节点组在正常情况下处于高可用模式。这里所述的高可用模式包括双活模式、多活模式和主备模式,其中,在双活模式和多活模式下,管理节点组内各个管理节点(例如节点110a和110b)都处于激活状态;在主备模式模式下,管理节点110a和110b的其中一个管理节点(例如节点110a)被指定为主节点而其余的管理节点(例如 110b)被指定为备用节点。The management node group is in high availability mode under normal conditions. The high availability modes described herein include a active mode, a multiple active mode, and an active/standby mode, in which each management node (eg, nodes 110a and 110b) within the management node group is active in the active mode and the multiple active mode. In the active/standby mode, one of the management nodes 110a and 110b (e.g., node 110a) is designated as the master node and the remaining management nodes (e.g., 110b) are designated as the standby node.
当监测到管理节点组内有管理节点(例如节点110a)发生故障时,为了确保服务的正常提供,发生故障的管理节点将从管理节点组中去除。在上面的例子中,管理节点110b将成为唯一可用的管理节点。此时,诸如双活模式和主备模式之类的高可用模式将不再可用,从而影响到整个分布式系统10的高可用性。When it is detected that a management node (for example, node 110a) in the management node group fails, in order to ensure the normal provision of the service, the failed management node will be removed from the management node group. In the above example, management node 110b will be the only available management node. At this time, high availability modes such as the active mode and the active and standby modes will no longer be available, thereby affecting the high availability of the entire distributed system 10.
按照本发明的一个方面,为了使分布式系统10恢复高可用性,可以从服务节点组中挑选合适的服务节点(例如服务节点120a)作为新的管理节点来替代发生故障的管理节点,从而使管理节点组再次进入高可用模式。例如在新的主备模式下,节点110b和120a分别作为主节点和备用节点;在双活模式下,节点110b和120a互为备份。In accordance with an aspect of the present invention, in order to restore the high availability of the distributed system 10, a suitable service node (e.g., service node 120a) may be selected from the group of service nodes as a new management node to replace the failed management node, thereby enabling management. The node group enters high availability mode again. For example, in the new active/standby mode, nodes 110b and 120a serve as primary and standby nodes, respectively; in active-active mode, nodes 110b and 120a are backups of each other.
图2为按照本发明一个实施例的用于在分布式系统中使管理能力自恢复的方法的流程图。示例性地,这里以图1所示的分布式系统为例来描述本实施例的方法。但是需要指出的是,本实施例的方法不局限于特定架构的分布式系统。需要指出的是,本实施例方法的各个步骤可以由部署在分布式系统20中的一个或多个节点上的硬件设备或软件模块单独或协同执行,也可以由分布式系统20中的独立于各个节点的设备或模块执行。2 is a flow diagram of a method for self-healing management capabilities in a distributed system, in accordance with one embodiment of the present invention. Illustratively, the method of the present embodiment is described herein by taking the distributed system shown in FIG. 1 as an example. It should be noted, however, that the method of the present embodiment is not limited to a distributed system of a specific architecture. It should be noted that the various steps of the method of the present embodiment may be performed separately or in concert by hardware devices or software modules deployed on one or more nodes in the distributed system 20, or may be independent of the distributed system 20. The device or module of each node is executed.
参见图2,在步骤S210,监测管理节点组内是否有管理节点发生故障,如果监测到有发生故障的管理节点(例如节点110a),则进入步骤S220,否则继续监测。Referring to FIG. 2, in step S210, it is monitored whether there is a management node failure in the management node group. If a failed management node (for example, node 110a) is detected, the process proceeds to step S220, otherwise the monitoring is continued.
在步骤S220,将发生故障的管理节点110a从管理节点组中去除,此时对于服务节点120a-120h来说,仅有管理节点110b负责支撑和保障服务节点,因此高可用模式不可用。At step S220, the failed management node 110a is removed from the management node group. At this time, for the service nodes 120a-120h, only the management node 110b is responsible for supporting and securing the service node, and thus the high availability mode is unavailable.
随后进入步骤S230,从服务节点组中选择具有高可用性的服务节点(例如节点120a)作为新的管理节点。以下将对选择的方式作详细描述。Then, proceeding to step S230, a service node (e.g., node 120a) having high availability is selected from the group of service nodes as a new management node. The manner of selection will be described in detail below.
接着进入步骤S240,将新的管理节点120a加入管理节点组。由 此,管理节点组可再次进入高可用模式。例如在主备模式下,节点110b和120a分别作为主节点和备用节点;在双活模式下,节点110b和120a互为备份。Next, proceeding to step S240, the new management node 120a is added to the management node group. As a result, the management node group can enter the high availability mode again. For example, in the active/standby mode, the nodes 110b and 120a serve as the primary node and the standby node, respectively; in the active-active mode, the nodes 110b and 120a are backups of each other.
图3为按照本发明另一个实施例的选择新的管理节点的方法的流程图。本实施例可作为实施图2所示方法中的步骤S230的具体方式。3 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.
如图3所示,在步骤S310,使每个服务节点向分布式架构中的其余节点(包括管理节点和服务节点)发送请求。As shown in FIG. 3, in step S310, each service node is caused to send a request to the remaining nodes (including the management node and the service node) in the distributed architecture.
随后进入步骤S320,使收到请求的节点返回基于区块链记账确认机制的记账确认应答。Then, proceeding to step S320, the node receiving the request returns a billing confirmation response based on the blockchain accounting confirmation mechanism.
接着进入步骤S330,根据对每个服务节点发送的请求的确认应答选择具有高可用性的服务节点作为新的管理节点。Next, proceeding to step S330, a service node having high availability is selected as a new management node according to a confirmation response to the request transmitted by each service node.
在步骤S330中,优选地,高可用性可以以设定时间周期内对所发送请求的应答成功率和/或应答平均时间来表示,并且可以将设定时间周期内具有较高应答成功率和/或应答平均时间的服务节点选择为新的管理节点。示例性地,可以为每个服务节点确定应答成功率与应答平均时间组合的得分(该得分例如可以是应答平均时间的倒数与应答成功率的加权和),并选择得分最高的服务节点作为新的管理节点。In step S330, preferably, the high availability may be represented by a response success rate and/or a response average time for the transmitted request within a set time period, and may have a higher response success rate and/or within the set time period. Or the service node that responds to the average time is selected as the new management node. Illustratively, a score of a combination of the response success rate and the response average time may be determined for each serving node (the score may be, for example, a weighted sum of the reciprocal of the response average time and the response success rate), and the service node with the highest score is selected as the new Management node.
图4为按照本发明另一个实施例的选择新的管理节点的方法的流程图。本实施例可作为实施图2所示方法中的步骤S230的具体方式。4 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.
如图4所示,在步骤S410,获取每个服务节点在提供服务的过程中的网络通信数据。As shown in FIG. 4, in step S410, network communication data in the process of providing a service by each service node is acquired.
随后进入步骤S420,根据每个服务节点的网络通信数据选择具有高可用性的服务节点作为新的管理节点。Then, proceeding to step S420, a service node having high availability is selected as a new management node according to network communication data of each service node.
在步骤S420中,优选地,高可用性可以以设定时间周期内每个服务节点与其它节点之间的响应时间的平均值来表示,并且将具有最短的响应时间的平均值的服务节点选择为新的管理节点。In step S420, preferably, high availability may be represented by an average of response times between each serving node and other nodes within a set time period, and the service node having the average of the shortest response times is selected as New management node.
图5为按照本发明另一个实施例的选择新的管理节点的方法的流程图。本实施例可作为实施图2所示方法中的步骤S230的具体方式。5 is a flow chart of a method of selecting a new management node in accordance with another embodiment of the present invention. This embodiment can be used as a specific manner of implementing step S230 in the method shown in FIG. 2.
如图5所示,在步骤S510,获取每个服务节点在提供服务的过程中的资源使用情况。As shown in FIG. 5, in step S510, resource usage of each service node in the process of providing a service is acquired.
随后进入步骤S520,根据每个服务节点的资源使用情况选择具有高可用性的服务节点作为新的管理节点。Then, proceeding to step S520, the service node with high availability is selected as the new management node according to the resource usage of each service node.
在步骤S520中,优选地,高可用性可以以高可用性以设定时间周期内每个服务节点的平均资源利用率来表示,并且将具有最低资源利用率的服务节点选择为新的管理节点。In step S520, preferably, high availability may be represented by high availability in an average resource utilization of each service node within a set time period, and the service node having the lowest resource utilization is selected as a new management node.
图6为按照本发明另一个实施例的用于在分布式系统中使管理能力自恢复的装置的框图。6 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.
如图6所示,本实施例的用于在分布式系统中使管理能力自恢复的装置60包括第一模块610、第二模块620和第三模块630。第一模块610用于如果监测到管理节点组内有管理节点发生故障,则将发生故障的管理节点从管理节点组中去除;第二模块620用于从所述服务节点组中选择具有高可用性的服务节点作为新的管理节点;第三模块630用于将新的管理节点加入所述管理节点组。As shown in FIG. 6, the apparatus 60 for self-restoring management capabilities in a distributed system of the present embodiment includes a first module 610, a second module 620, and a third module 630. The first module 610 is configured to remove the failed management node from the management node group if a management node in the management node group is detected to be faulty; and the second module 620 is configured to select high availability from the service node group. The service node acts as a new management node; the third module 630 is used to join the new management node to the management node group.
图7为按照本发明另一个实施例的用于在分布式系统中使管理能力自恢复的装置的框图。7 is a block diagram of an apparatus for self-healing management capabilities in a distributed system in accordance with another embodiment of the present invention.
图7所示的装置70包含存储器70、处理器720以及存储在存储器70上并可在处理器720上运行的计算机程序730,其中,计算机程序730通过在处理器720上运行以可执行如上借助图2-5所述实施例的方法。The apparatus 70 shown in FIG. 7 includes a memory 70, a processor 720, and a computer program 730 stored on the memory 70 and operative on the processor 720, wherein the computer program 730 is executable by operating on the processor 720 The method of the embodiment described in Figures 2-5.
按照本发明的一个方面,提供一种计算机可读存储介质,其上存储计算机程序,该程序被处理器执行时实现借助图2-5所述实施例的方法。According to an aspect of the invention, there is provided a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method of the embodiment described with reference to Figures 2-5.
提供本文中提出的实施例和示例,以便最好地说明按照本技术及其特定应用的实施例,并且由此使本领域的技术人员能够实施和使用本发明。但是,本领域的技术人员将会知道,仅为了便于说明和举例而提供以上描述和示例。所提出的描述不是意在涵盖本发明的各个方 面或者将本发明局限于所公开的精确形式。The embodiments and examples set forth herein are provided to best illustrate the embodiments of the present invention and the specific application thereof, and thereby enabling those skilled in the art to make and use the invention. However, those skilled in the art will appreciate that the above description and examples are provided for ease of illustration and illustration. The description is not intended to be exhaustive or to limit the invention.
鉴于以上所述,本公开的范围通过以下权利要求书来确定。In view of the above, the scope of the present disclosure is determined by the following claims.

Claims (12)

  1. 一种用于在分布式系统中使管理能力自恢复的方法,所述分布式系统包括管理节点组和服务节点组,其特征在于,所述方法包含下列步骤:A method for self-healing management capabilities in a distributed system, the distributed system comprising a management node group and a service node group, wherein the method comprises the following steps:
    如果监测到管理节点组内有管理节点发生故障,则将发生故障的管理节点从管理节点组中去除;If a management node in the management node group is detected to be faulty, the failed management node is removed from the management node group;
    从所述服务节点组中选择具有高可用性的服务节点作为新的管理节点;以及Selecting a service node with high availability from the set of service nodes as a new management node;
    将新的管理节点加入所述管理节点组。Add a new management node to the management node group.
  2. 如权利要求1所述的方法,其中,选择新的管理节点的步骤包括:The method of claim 1 wherein the step of selecting a new management node comprises:
    使每个服务节点向分布式架构中的其余节点发送请求;Causing each service node to send a request to the remaining nodes in the distributed architecture;
    使收到请求的节点返回基于区块链记账确认机制的记账确认应答;以及Causeting the node receiving the request to return a billing confirmation response based on the blockchain accounting confirmation mechanism;
    根据对每个服务节点发送的请求的确认应答选择具有高可用性的服务节点作为新的管理节点。A service node with high availability is selected as a new management node based on a confirmation response to the request sent by each service node.
  3. 如权利要求2所述的方法,其中,所述高可用性以设定时间周期内对所发送请求的应答成功率和/或应答平均时间来表示并且将设定时间周期内具有较高应答成功率和/或应答平均时间的服务节点选择为新的管理节点。The method of claim 2 wherein said high availability is expressed in response time success rate and/or response average time over a set time period and has a higher response success rate during a set time period And/or the service node that responds to the average time is selected as the new management node.
  4. 如权利要求1所述的方法,其中,选择新的管理节点的步骤包括:The method of claim 1 wherein the step of selecting a new management node comprises:
    获取每个服务节点在提供服务的过程中的网络通信数据;以及Obtaining network communication data for each service node in the process of providing the service;
    根据每个服务节点的网络通信数据选择具有高可用性的服务节点作为新的管理节点。A service node with high availability is selected as a new management node according to network communication data of each service node.
  5. 如权利要求4所述的方法,其中,所述高可用性以设定时间周期内每个服务节点与其它节点之间的响应时间的平均值并且将具 有最短的响应时间的平均值的服务节点选择为新的管理节点。The method of claim 4, wherein said high availability sets an average of response times between each of the service nodes and other nodes within a set time period and selects a service node having an average of the shortest response times For the new management node.
  6. 如权利要求1所述的方法,其中,选择新的管理节点的步骤包括:The method of claim 1 wherein the step of selecting a new management node comprises:
    获取每个服务节点在提供服务的过程中的资源使用情况;以及Obtaining the resource usage of each service node in the process of providing the service;
    根据每个服务节点的资源使用情况选择具有高可用性的服务节点作为新的管理节点。A service node with high availability is selected as a new management node according to the resource usage of each service node.
  7. 如权利要求6所述的方法,其中,所述高可用性以设定时间周期内每个服务节点的平均资源利用率来表示并且将具有最低资源利用率的服务节点选择为新的管理节点。The method of claim 6, wherein the high availability is represented by an average resource utilization of each service node within a set time period and the service node having the lowest resource utilization is selected as a new management node.
  8. 一种用于在分布式系统中使管理能力自恢复的装置,所述分布式系统包括管理节点组和服务节点组,其特征在于,所述装置包含:An apparatus for self-recovering management capabilities in a distributed system, the distributed system comprising a management node group and a service node group, wherein the apparatus comprises:
    第一模块,用于如果监测到管理节点组内有管理节点发生故障,则将发生故障的管理节点从管理节点组中去除;a first module, configured to remove a failed management node from the management node group if a management node in the management node group is detected to be faulty;
    第二模块,用于从所述服务节点组中选择具有高可用性的服务节点作为新的管理节点;以及a second module, configured to select a service node with high availability from the service node group as a new management node;
    第三模块,用于将新的管理节点加入所述管理节点组。The third module is configured to join the new management node to the management node group.
  9. 如权利要求8所述的装置,其中,所述装置被部署在分布式系统的单个或多个节点内。The apparatus of claim 8 wherein the apparatus is deployed within a single or a plurality of nodes of a distributed system.
  10. 一种用于在分布式系统中使管理能力自恢复的装置,所述分布式系统包括管理节点组和服务节点组,所述装置包含存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,执行如权利要求1-7中任意一项所述的方法。An apparatus for self-healing management capabilities in a distributed system, the distributed system comprising a management node group and a service node group, the apparatus comprising a memory, a processor, and being stored on the memory and A computer program running on a processor, characterized in that the method of any one of claims 1-7 is performed.
  11. 如权利要求10所述的装置,其中,所述装置被部署在分布式系统的单个或多个节点内。The apparatus of claim 10 wherein the apparatus is deployed within a single or a plurality of nodes of a distributed system.
  12. 一种计算机可读存储介质,其上存储计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-7中任意一项所述的方法。A computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor to implement the method of any of claims 1-7.
PCT/CN2018/119528 2017-12-28 2018-12-06 Method and apparatus for enabling self-recovery of management capability in distributed system WO2019128670A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711456829.2 2017-12-28
CN201711456829.2A CN108306760A (en) 2017-12-28 2017-12-28 For making the self-healing method and apparatus of managerial ability in a distributed system

Publications (1)

Publication Number Publication Date
WO2019128670A1 true WO2019128670A1 (en) 2019-07-04

Family

ID=62867775

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/119528 WO2019128670A1 (en) 2017-12-28 2018-12-06 Method and apparatus for enabling self-recovery of management capability in distributed system

Country Status (3)

Country Link
CN (1) CN108306760A (en)
TW (1) TWI701916B (en)
WO (1) WO2019128670A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108306760A (en) * 2017-12-28 2018-07-20 中国银联股份有限公司 For making the self-healing method and apparatus of managerial ability in a distributed system
CN109218077A (en) * 2018-08-14 2019-01-15 阿里巴巴集团控股有限公司 Prediction technique, device, electronic equipment and the storage medium of target device
JP6726367B2 (en) * 2018-12-13 2020-07-22 アリババ・グループ・ホールディング・リミテッドAlibaba Group Holding Limited Changing the primary node in a distributed system
MX2019008861A (en) 2018-12-13 2019-09-11 Alibaba Group Holding Ltd Achieving consensus among network nodes in a distributed system.
KR102157452B1 (en) 2018-12-13 2020-09-18 알리바바 그룹 홀딩 리미티드 Performing a recovery process for network nodes in a distributed system
US10944624B2 (en) 2019-06-28 2021-03-09 Advanced New Technologies Co., Ltd. Changing a master node in a blockchain system
CN110351133B (en) * 2019-06-28 2021-09-17 创新先进技术有限公司 Method and device for main node switching processing in block chain system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105406980A (en) * 2015-10-19 2016-03-16 浪潮(北京)电子信息产业有限公司 Multi-node backup method and multi-node backup device
CN107105032A (en) * 2017-04-20 2017-08-29 腾讯科技(深圳)有限公司 node device operation method and node device
CN107453929A (en) * 2017-09-22 2017-12-08 中国联合网络通信集团有限公司 Group system is from construction method, device and group system
CN108306760A (en) * 2017-12-28 2018-07-20 中国银联股份有限公司 For making the self-healing method and apparatus of managerial ability in a distributed system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152434A (en) * 2013-03-27 2013-06-12 江苏辰云信息科技有限公司 Leader node replacing method of distributed cloud system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105406980A (en) * 2015-10-19 2016-03-16 浪潮(北京)电子信息产业有限公司 Multi-node backup method and multi-node backup device
CN107105032A (en) * 2017-04-20 2017-08-29 腾讯科技(深圳)有限公司 node device operation method and node device
CN107453929A (en) * 2017-09-22 2017-12-08 中国联合网络通信集团有限公司 Group system is from construction method, device and group system
CN108306760A (en) * 2017-12-28 2018-07-20 中国银联股份有限公司 For making the self-healing method and apparatus of managerial ability in a distributed system

Also Published As

Publication number Publication date
TW201931821A (en) 2019-08-01
TWI701916B (en) 2020-08-11
CN108306760A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
WO2019128670A1 (en) Method and apparatus for enabling self-recovery of management capability in distributed system
US11734138B2 (en) Hot standby method, apparatus, and system
WO2016127580A1 (en) Method, device and system for processing fault in at least one distributed cluster
CN112463448B (en) Distributed cluster database synchronization method, device, equipment and storage medium
CN102394914A (en) Cluster brain-split processing method and device
CN107508694B (en) Node management method and node equipment in cluster
CN103888277A (en) Gateway disaster recovery backup method, apparatus and system
WO2016177231A1 (en) Dual-control-based active-backup switching method and device
CN111385107B (en) Main/standby switching processing method and device for server
WO2012174893A1 (en) Dual-center disaster recovery-based switching method and device in iptv system
CN106464516B (en) Event handling in a network management system
CN107360025B (en) Distributed storage system cluster monitoring method and device
CN105812161B (en) A kind of controller failure backup method and system
CN102843259A (en) Middleware self-management hot backup method and middleware self-management hot backup system in cluster
CN113535480A (en) Data disaster recovery system and method
CN112218321A (en) Main/standby link switching method and device, communication equipment and storage medium
WO2020088351A1 (en) Method for sending device information, computer device and distributed computer device system
CN103793296A (en) Method for assisting in backing-up and copying computer system in cluster
WO2011035496A1 (en) Protection method for subscriber access network and equipment thereof
CN106534758B (en) Conference backup method and device
CN114124803B (en) Device management method and device, electronic device and storage medium
CN112491633B (en) Fault recovery method, system and related components of multi-node cluster
CN111416726B (en) Resource management method, sending end equipment and receiving end equipment
CN114327911A (en) Remote multi-activity implementation method of distributed service cluster and distributed service system
CN114301763A (en) Distributed cluster fault processing method and system, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18897071

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18897071

Country of ref document: EP

Kind code of ref document: A1