TWI701916B - Method and device for self-recovering management ability in distributed system - Google Patents

Method and device for self-recovering management ability in distributed system Download PDF

Info

Publication number
TWI701916B
TWI701916B TW107144779A TW107144779A TWI701916B TW I701916 B TWI701916 B TW I701916B TW 107144779 A TW107144779 A TW 107144779A TW 107144779 A TW107144779 A TW 107144779A TW I701916 B TWI701916 B TW I701916B
Authority
TW
Taiwan
Prior art keywords
management node
node
management
distributed system
node group
Prior art date
Application number
TW107144779A
Other languages
Chinese (zh)
Other versions
TW201931821A (en
Inventor
何東傑
Original Assignee
大陸商中國銀聯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大陸商中國銀聯股份有限公司 filed Critical 大陸商中國銀聯股份有限公司
Publication of TW201931821A publication Critical patent/TW201931821A/en
Application granted granted Critical
Publication of TWI701916B publication Critical patent/TWI701916B/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)
  • Multi Processors (AREA)

Abstract

本發明涉及網絡技術,特別涉及用於在分布式系統中使管理能力自恢復的方法、實施該方法的裝置以及包含實施該方法的計算機程序的計算機可讀存儲介質。在按照本發明一個方面的用於在分布式系統中使管理能力自恢復的方法中,所述分布式系統包括管理節點組和服務節點組,所述方法包含下列步驟:如果監測到管理節點組內有管理節點發生故障,則通過將發生故障的管理節點從管理節點組中去除;從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;以及將新的管理節點加入所述管理節點組。The present invention relates to network technology, and in particular to a method for self-recovering management capabilities in a distributed system, a device for implementing the method, and a computer-readable storage medium containing a computer program for implementing the method. In a method for self-recovering management capabilities in a distributed system according to one aspect of the present invention, the distributed system includes a management node group and a service node group, and the method includes the following steps: if the management node group is detected If a management node fails, the management node is removed from the management node group; the service node with high availability is selected from the service node group as the new management node; and the new management node is added to the management node group. The management node group.

Description

用於在分布式系統中使管理能力自恢復的方法和裝置Method and device for self-recovery of management ability in distributed system

本發明涉及網絡技術,特別涉及用於在分布式系統中使管理能力自恢復的方法、實施該方法的裝置以及包含實施該方法的計算機程序的計算機可讀存儲介質。The present invention relates to network technology, and in particular to a method for self-recovering management capabilities in a distributed system, a device for implementing the method, and a computer-readable storage medium containing a computer program for implementing the method.

分布式架構已經成為當前信息系統架構發展的主要趨勢。現有的分布式架構中一般採用管理節點加上服務節點的方式進行設計,管理節點的高可用加上服務節點的高可用構成了整個系統的高可用性。 在分布式系統中,管理節點負責支撐和保障服務節點的高可用性,其一般採用雙節點、三節點集群等方式。服務節點由管理節點來保障服務的高可用性,以當服務節點發生故障時不影響整體服務能力。顯然,管理節點的高可用性是整個系統的核心。 在現有的技術方案中,當管理節點發生故障時,通常通過主備切換或者雙活服務的切換來保證服務的延續性。但是,在系統切換後,原先的高可用性將不復存在,這一般需要通過手動方式進行恢復,因此效率較低。Distributed architecture has become the main trend in the development of current information system architecture. The existing distributed architecture generally adopts a management node plus a service node for design. The high availability of the management node and the high availability of the service node constitute the high availability of the entire system. In a distributed system, the management node is responsible for supporting and ensuring the high availability of the service node, which generally adopts a two-node, three-node cluster, etc. The service node is guaranteed by the management node to ensure the high availability of the service, so that when the service node fails, the overall service capability is not affected. Obviously, the high availability of the management node is the core of the entire system. In the existing technical solutions, when the management node fails, the continuity of the service is usually ensured by switching between active and standby or switching between active and active services. However, after the system is switched, the original high availability will no longer exist, which generally requires manual recovery, so the efficiency is low.

本發明的一個目的是提供一種用於在分布式系統中使管理能力自恢復的方法和裝置,其具有實施便捷和恢復能力強等優點。 在按照本發明一個方面的用於在分布式系統中使管理能力自恢復的方法中,所述分布式系統包括管理節點組和服務節點組,所述方法包含下列步驟: 如果監測到管理節點組內有管理節點發生故障,則將發生故障的管理節點從管理節點組中去除; 從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;以及 將新的管理節點加入所述管理節點組。 優選地,在上述方法中,選擇新的管理節點的步驟包括: 使每個服務節點向分布式架構中的其餘節點發送請求; 使收到請求的節點返回基於區塊鏈記帳確認機制的記帳確認應答;以及 根據對每個服務節點發送的請求的確認應答選擇具有高可用性的服務節點作為新的管理節點。 優選地,在上述方法中,所述高可用性以設定時間週期內對所發送請求的應答成功率和/或應答平均時間來表示並且將設定時間週期內具有較高應答成功率和/或應答平均時間的服務節點選擇為新的管理節點。 優選地,在上述方法中,選擇新的管理節點的步驟包括: 獲取每個服務節點在提供服務的過程中的網絡通信數據;以及 根據每個服務節點的網絡通信數據選擇具有高可用性的服務節點作為新的管理節點。 優選地,在上述方法中,所述高可用性以設定時間週期內每個服務節點與其它節點之間的響應時間的平均值並且將具有最短的響應時間的平均值的服務節點選擇為新的管理節點。 優選地,在上述方法中,選擇新的管理節點的步驟包括: 獲取每個服務節點在提供服務的過程中的資源使用情況;以及 根據每個服務節點的資源使用情況選擇具有高可用性的服務節點作為新的管理節點。 優選地,在上述方法中,所述高可用性以設定時間週期內每個服務節點的平均資源利用率來表示並且將具有最低資源利用率的服務節點選擇為新的管理節點。 在按照本發明另一個方面的用於在分布式系統中使管理能力自恢復的裝置中,所述分布式系統包括管理節點組和服務節點組,所述裝置包含: 第一模塊,用於如果監測到管理節點組內有管理節點發生故障,則將發生故障的管理節點從管理節點組中去除; 第二模塊,用於從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;以及 第三模塊,用於將新的管理節點加入所述管理節點組。 在按照本發明另一個方面的用於在分布式系統中使管理能力自恢復的裝置中,所述分布式系統包括管理節點組和服務節點組,所述裝置包含存儲器、處理器以及存儲在所述存儲器上並可在所述處理器上運行的計算機程序以執行如上所述的方法。 本發明的還有一個目的是提供一種計算機可讀存儲介質,其上存儲計算機程序,該程序被處理器執行時實現如上所述的方法。 與現有技術相比,本發明具有諸多優點。例如,當分布式系統因管理節點發生故障而切換模式後,按照本發明上述各個方面的方法和裝置能夠快速自動恢復系統原來的高可用性,從而極大提升系統維護的效率並且提高了系統高可用性的保障程度。An object of the present invention is to provide a method and device for self-recovery of management capabilities in a distributed system, which has the advantages of convenient implementation and strong recovery capabilities. In a method for self-recovering management capabilities in a distributed system according to an aspect of the present invention, the distributed system includes a management node group and a service node group, and the method includes the following steps: If it is detected that a management node in the management node group fails, the management node that has failed is removed from the management node group; Selecting a service node with high availability from the service node group as the new management node; and Add a new management node to the management node group. Preferably, in the above method, the step of selecting a new management node includes: Make each service node send requests to the rest of the nodes in the distributed architecture; Make the node that receives the request return an accounting confirmation response based on the blockchain accounting confirmation mechanism; and According to the confirmation response to the request sent by each service node, the service node with high availability is selected as the new management node. Preferably, in the above method, the high availability is expressed in terms of the response success rate and/or average response time to the sent request within a set time period and will have a higher response success rate and/or average response time within the set time period The time service node is selected as the new management node. Preferably, in the above method, the step of selecting a new management node includes: Obtain the network communication data of each service node in the process of providing services; and According to the network communication data of each service node, a service node with high availability is selected as the new management node. Preferably, in the above method, the high availability is based on the average value of the response time between each service node and other nodes in a set time period, and the service node with the shortest average response time is selected as the new management node. Preferably, in the above method, the step of selecting a new management node includes: Obtain the resource usage of each service node in the process of providing services; and According to the resource usage of each service node, a service node with high availability is selected as the new management node. Preferably, in the above method, the high availability is represented by the average resource utilization rate of each service node in a set time period, and the service node with the lowest resource utilization rate is selected as the new management node. In a device for self-recovering management capabilities in a distributed system according to another aspect of the present invention, the distributed system includes a management node group and a service node group, and the device includes: The first module is used to remove the failed management node from the management node group if it is detected that a management node in the management node group has failed; The second module is used to select a service node with high availability from the service node group as a new management node; and The third module is used to add a new management node to the management node group. In a device for self-recovering management capabilities in a distributed system according to another aspect of the present invention, the distributed system includes a management node group and a service node group, and the device includes a memory, a processor, and A computer program on the memory and executable on the processor is used to execute the method as described above. Another object of the present invention is to provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method described above is realized. Compared with the prior art, the present invention has many advantages. For example, when a distributed system switches modes due to a failure of a management node, the method and device according to the above aspects of the present invention can quickly and automatically restore the original high availability of the system, thereby greatly improving the efficiency of system maintenance and improving the high availability of the system. Degree of protection.

下面參照其中圖示了本發明示意性實施例的附圖更為全面地說明本發明。但本發明可以按不同形式來實現,而不應解讀為僅限於本文給出的各實施例。給出的上述各實施例旨在使本文的披露全面完整,以將本發明的保護範圍更為全面地傳達給本領域技術人員。 在本說明書中,諸如“包含”和“包括”之類的用語表示除了具有在說明書和權利要求書中有直接和明確表述的單元和步驟以外,本發明的技術方案也不排除具有未被直接或明確表述的其它單元和步驟的情形。 圖1為一種分布式系統的架構示意圖。示例性地,圖1所示的分布式系統10包括管理節點110a和110b(這些節點們組成管理節點組)和服務節點120a-120h(這些節點組成服務節點組)。在所示的分布式系統中,各個節點可以直接實現通信連接,或者經第三方節點實現通信連接。 管理節點組在正常情況下處于高可用模式。這裡所述的高可用模式包括雙活模式、多活模式和主備模式,其中,在雙活模式和多活模式下,管理節點組內各個管理節點(例如節點110a和110b)都處於激活狀態;在主備模式模式下,管理節點110a和110b的其中一個管理節點(例如節點110a)被指定為主節點而其餘的管理節點(例如110b)被指定為備用節點。 當監測到管理節點組內有管理節點(例如節點110a)發生故障時,為了確保服務的正常提供,發生故障的管理節點將從管理節點組中去除。在上面的例子中,管理節點110b將成為唯一可用的管理節點。此時,諸如雙活模式和主備模式之類的高可用模式將不再可用,從而影響到整個分布式系統10的高可用性。 按照本發明的一個方面,為了使分布式系統10恢復高可用性,可以從服務節點組中挑選合適的服務節點(例如服務節點120a)作為新的管理節點來替代發生故障的管理節點,從而使管理節點組再次進入高可用模式。例如在新的主備模式下,節點110b和120a分別作為主節點和備用節點;在雙活模式下,節點110b和120a互為備份。 圖2為按照本發明一個實施例的用於在分布式系統中使管理能力自恢復的方法的流程圖。示例性地,這裡以圖1所示的分布式系統為例來描述本實施例的方法。但是需要指出的是,本實施例的方法不局限於特定架構的分布式系統。需要指出的是,本實施例方法的各個步驟可以由部署在分布式系統20中的一個或多個節點上的硬件設備或軟件模塊單獨或協同執行,也可以由分布式系統20中的獨立於各個節點的設備或模塊執行。 參見圖2,在步驟S210,監測管理節點組內是否有管理節點發生故障,如果監測到有發生故障的管理節點(例如節點110a),則進入步驟S220,否則繼續監測。 在步驟S220,將發生故障的管理節點110a從管理節點組中去除,此時對於服務節點120a-120h來說,僅有管理節點110b負責支撐和保障服務節點,因此高可用模式不可用。 隨後進入步驟S230,從服務節點組中選擇具有高可用性的服務節點(例如節點120a)作為新的管理節點。以下將對選擇的方式作詳細描述。 接著進入步驟S240,將新的管理節點120a加入管理節點組。由此,管理節點組可再次進入高可用模式。例如在主備模式下,節點110b和120a分別作為主節點和備用節點;在雙活模式下,節點110b和120a互為備份。 圖3為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。本實施例可作為實施圖2所示方法中的步驟S230的具體方式。 如圖3所示,在步驟S310,使每個服務節點向分布式架構中的其餘節點(包括管理節點和服務節點)發送請求。 隨後進入步驟S320,使收到請求的節點返回基於區塊鏈記帳確認機制的記帳確認應答。 接著進入步驟S330,根據對每個服務節點發送的請求的確認應答選擇具有高可用性的服務節點作為新的管理節點。 在步驟S330中,優選地,高可用性可以以設定時間週期內對所發送請求的應答成功率和/或應答平均時間來表示,並且可以將設定時間週期內具有較高應答成功率和/或應答平均時間的服務節點選擇為新的管理節點。示例性地,可以為每個服務節點確定應答成功率與應答平均時間組合的得分(該得分例如可以是應答平均時間的倒數與應答成功率的加權和),並選擇得分最高的服務節點作為新的管理節點。 圖4為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。本實施例可作為實施圖2所示方法中的步驟S230的具體方式。 如圖4所示,在步驟S410,獲取每個服務節點在提供服務的過程中的網絡通信數據。 隨後進入步驟S420,根據每個服務節點的網絡通信數據選擇具有高可用性的服務節點作為新的管理節點。 在步驟S420中,優選地,高可用性可以以設定時間週期內每個服務節點與其它節點之間的響應時間的平均值來表示,並且將具有最短的響應時間的平均值的服務節點選擇為新的管理節點。 圖5為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。本實施例可作為實施圖2所示方法中的步驟S230的具體方式。 如圖5所示,在步驟S510,獲取每個服務節點在提供服務的過程中的資源使用情況。 隨後進入步驟S520,根據每個服務節點的資源使用情況選擇具有高可用性的服務節點作為新的管理節點。 在步驟S520中,優選地,高可用性可以以高可用性以設定時間週期內每個服務節點的平均資源利用率來表示,並且將具有最低資源利用率的服務節點選擇為新的管理節點。 圖6為按照本發明另一個實施例的用於在分布式系統中使管理能力自恢復的裝置的框圖。 如圖6所示,本實施例的用於在分布式系統中使管理能力自恢復的裝置60包括第一模塊610、第二模塊620和第三模塊630。第一模塊610用於如果監測到管理節點組內有管理節點發生故障,則將發生故障的管理節點從管理節點組中去除;第二模塊620用於從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;第三模塊630用於將新的管理節點加入所述管理節點組。 圖7為按照本發明另一個實施例的用於在分布式系統中使管理能力自恢復的裝置的框圖。 圖7所示的裝置70包含存儲器70、處理器720以及存儲在存儲器70上並可在處理器720上運行的計算機程序730,其中,計算機程序730通過在處理器720上運行以可執行如上借助圖2-5所述實施例的方法。 按照本發明的一個方面,提供一種計算機可讀存儲介質,其上存儲計算機程序,該程序被處理器執行時實現借助圖2-5所述實施例的方法。 提供本文中提出的實施例和示例,以便最好地說明按照本技術及其特定應用的實施例,並且由此使本領域的技術人員能夠實施和使用本發明。但是,本領域的技術人員將會知道,僅為了便於說明和舉例而提供以上描述和示例。所提出的描述不是意在涵蓋本發明的各個方面或者將本發明局限於所公開的精確形式。 鑒於以上所述,本公開的範圍通過以下權利要求書來確定。Hereinafter, the present invention will be explained more fully with reference to the accompanying drawings in which exemplary embodiments of the present invention are illustrated. However, the present invention can be implemented in different forms and should not be interpreted as being limited to the embodiments given herein. The above-mentioned embodiments are provided to make the disclosure herein comprehensive and complete, so as to more comprehensively convey the protection scope of the present invention to those skilled in the art. In this specification, terms such as "including" and "including" mean that in addition to units and steps that are directly and clearly stated in the specification and claims, the technical solution of the present invention does not exclude Or other units and steps clearly stated. Figure 1 is a schematic diagram of a distributed system architecture. Illustratively, the distributed system 10 shown in FIG. 1 includes management nodes 110a and 110b (these nodes form a management node group) and service nodes 120a-120h (these nodes form a service node group). In the distributed system shown, each node can directly realize a communication connection, or realize a communication connection via a third-party node. The management node group is in a high availability mode under normal circumstances. The high-availability modes mentioned here include active-active mode, active-active mode, and active-standby mode. In active-active mode and active-active mode, each management node in the management node group (for example, nodes 110a and 110b) are all active. ; In the active-standby mode, one of the management nodes 110a and 110b (for example, node 110a) is designated as the master node and the remaining management nodes (for example, 110b) are designated as standby nodes. When it is detected that a management node (for example, node 110a) in the management node group has failed, in order to ensure the normal provision of services, the failed management node will be removed from the management node group. In the above example, the management node 110b will become the only available management node. At this time, high availability modes such as the active-active mode and the active-standby mode will no longer be available, thereby affecting the high availability of the entire distributed system 10. According to one aspect of the present invention, in order to restore the high availability of the distributed system 10, a suitable service node (for example, the service node 120a) can be selected from the service node group as a new management node to replace the failed management node, so that the management The node group enters the high availability mode again. For example, in the new active-standby mode, the nodes 110b and 120a serve as the master node and the standby node, respectively; in the dual-active mode, the nodes 110b and 120a serve as backups for each other. Fig. 2 is a flowchart of a method for self-recovering management capabilities in a distributed system according to an embodiment of the present invention. Exemplarily, the distributed system shown in FIG. 1 is taken as an example to describe the method of this embodiment. However, it should be pointed out that the method of this embodiment is not limited to a distributed system with a specific architecture. It should be pointed out that the steps of the method in this embodiment can be executed individually or cooperatively by hardware devices or software modules deployed on one or more nodes in the distributed system 20, or can be executed independently of the hardware devices or software modules in the distributed system 20. The equipment or modules of each node are executed. Referring to Fig. 2, in step S210, it is monitored whether any management node in the management node group has failed. If a failed management node (such as node 110a) is detected, step S220 is entered, otherwise the monitoring is continued. In step S220, the failed management node 110a is removed from the management node group. At this time, for the service nodes 120a-120h, only the management node 110b is responsible for supporting and guaranteeing the service node, so the high availability mode is not available. Then enter step S230, select a highly available service node (for example, node 120a) from the service node group as the new management node. The selection method will be described in detail below. Then go to step S240 to add the new management node 120a to the management node group. Thus, the management node group can enter the high availability mode again. For example, in the active-standby mode, the nodes 110b and 120a serve as the master node and the standby node, respectively; in the dual-active mode, the nodes 110b and 120a serve as backups for each other. Fig. 3 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. This embodiment can be used as a specific way to implement step S230 in the method shown in FIG. 2. As shown in FIG. 3, in step S310, each service node is caused to send a request to the remaining nodes (including the management node and the service node) in the distributed architecture. Then step S320 is entered to make the node receiving the request return a billing confirmation response based on the blockchain billing confirmation mechanism. Then enter step S330, and select a highly available service node as a new management node according to the confirmation response to the request sent by each service node. In step S330, preferably, high availability can be expressed in terms of the response success rate and/or average response time to the sent request within a set time period, and may have a high response success rate and/or response time within the set time period. The average time service node is selected as the new management node. Exemplarily, the score of the combination of the response success rate and the average response time can be determined for each service node (the score can be, for example, the weighted sum of the reciprocal of the average response time and the response success rate), and the service node with the highest score can be selected as the new Management node. Fig. 4 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. This embodiment can be used as a specific way to implement step S230 in the method shown in FIG. 2. As shown in FIG. 4, in step S410, network communication data of each service node in the process of providing services is acquired. Then, step S420 is entered, and a service node with high availability is selected as a new management node according to the network communication data of each service node. In step S420, preferably, high availability can be represented by the average value of the response time between each service node and other nodes in a set time period, and the service node with the shortest average response time is selected as the new Management node. Fig. 5 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. This embodiment can be used as a specific way to implement step S230 in the method shown in FIG. 2. As shown in Figure 5, in step S510, the resource usage of each service node in the process of providing services is obtained. Then, step S520 is entered, and a service node with high availability is selected as a new management node according to the resource usage of each service node. In step S520, preferably, high availability can be represented by high availability by the average resource utilization rate of each service node in a set time period, and the service node with the lowest resource utilization rate is selected as the new management node. Fig. 6 is a block diagram of an apparatus for self-recovering management capabilities in a distributed system according to another embodiment of the present invention. As shown in FIG. 6, the apparatus 60 for self-recovering management capabilities in a distributed system of this embodiment includes a first module 610, a second module 620, and a third module 630. The first module 610 is used to remove the failed management node from the management node group if it is detected that a management node in the management node group is faulty; the second module 620 is used to select high availability from the service node group The service node of is used as the new management node; the third module 630 is used to add the new management node to the management node group. Fig. 7 is a block diagram of an apparatus for self-recovering management capabilities in a distributed system according to another embodiment of the present invention. The device 70 shown in FIG. 7 includes a memory 70, a processor 720, and a computer program 730 that is stored on the memory 70 and can be run on the processor 720. The computer program 730 runs on the processor 720 to execute the above The method of the embodiment described in Figures 2-5. According to one aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the method according to the embodiments described in FIGS. 2-5 is implemented. The embodiments and examples presented herein are provided in order to best illustrate the embodiments according to the present technology and its specific applications, and thereby enable those skilled in the art to implement and use the present invention. However, those skilled in the art will know that the above description and examples are provided only for ease of description and examples. The presented description is not intended to cover every aspect of the invention or to limit the invention to the precise form disclosed. In view of the foregoing, the scope of the present disclosure is determined by the following claims.

120a~120h‧‧‧服務節點 110a、110b‧‧‧管理節點 S210~S240‧‧‧步驟 S310~S330‧‧‧步驟 S410、S420‧‧‧步驟 S510、S520‧‧‧步驟 610‧‧‧第一模塊 620‧‧‧第二模塊 630‧‧‧第三模塊 720‧‧‧處理器 730‧‧‧計算機程序120a~120h‧‧‧Service Node 110a, 110b‧‧‧management node S210~S240‧‧‧Step S310~S330‧‧‧Step S410, S420‧‧‧Step S510, S520‧‧‧Step 610‧‧‧First module 620‧‧‧Second module 630‧‧‧The third module 720‧‧‧Processor 730‧‧‧computer program

本發明的上述和/或其它方面和優點將通過以下結合附圖的各個方面的描述變得更加清晰和更容易理解,附圖中相同或相似的單元採用相同的標號表示。附圖包括: 圖1為一種分布式系統的架構示意圖。 圖2為按照本發明一個實施例的用於在分布式系統中使管理能力自恢復的方法的流程圖。 圖3為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。 圖4為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。 圖5為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。 圖6為按照本發明另一個實施例的用於在分布式系統中使管理能力自恢復的裝置的框圖。 圖7為按照本發明另一個實施例的用於在分布式系統中使管理能力自恢復的裝置的框圖。The above and/or other aspects and advantages of the present invention will become clearer and easier to understand through the following description of each aspect in conjunction with the accompanying drawings. The same or similar elements in the accompanying drawings are represented by the same reference numerals. The drawings include: Figure 1 is a schematic diagram of a distributed system architecture. Fig. 2 is a flowchart of a method for self-recovering management capabilities in a distributed system according to an embodiment of the present invention. Fig. 3 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. Fig. 4 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. Fig. 5 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. Fig. 6 is a block diagram of an apparatus for self-recovering management capabilities in a distributed system according to another embodiment of the present invention. Fig. 7 is a block diagram of an apparatus for self-recovering management capabilities in a distributed system according to another embodiment of the present invention.

Claims (7)

一種用於在分布式系統中使管理能力自恢復的方法,所述分布式系統包括管理節點組和服務節點組,其特徵在於,所述方法包含下列步驟:如果監測到管理節點組內有管理節點發生故障,則將發生故障的管理節點從管理節點組中去除;從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;以及將新的管理節點加入所述管理節點組,其中,所述高可用性以設定時間週期內對所發送請求的應答成功率與應答平均時間的倒數的加權和來表徵並且將具有最高加權和的服務節點選擇為新的管理節點。 A method for self-recovery of management capabilities in a distributed system. The distributed system includes a management node group and a service node group. The method is characterized in that the method includes the following steps: if it is detected that there is management in the management node group If a node fails, remove the failed management node from the management node group; select a highly available service node from the service node group as the new management node; and add the new management node to the management node group , Wherein the high availability is characterized by the weighted sum of the response success rate to the sent request and the reciprocal of the average response time within a set time period, and the service node with the highest weighted sum is selected as the new management node. 如申請專利範圍第1項所述的方法,其中,選擇新的管理節點的步驟包括:使每個服務節點向分布式架構中的其餘節點發送請求;使收到請求的節點返回基於區塊鏈記帳確認機制的記帳確認應答;以及根據對每個服務節點發送的請求的確認應答選擇具有高可用性的服務節點作為新的管理節點。 As the method described in item 1 of the scope of patent application, the step of selecting a new management node includes: making each service node send a request to the rest of the nodes in the distributed architecture; making the node receiving the request return based on the blockchain The accounting confirmation response of the accounting confirmation mechanism; and according to the confirmation response to the request sent by each service node, the service node with high availability is selected as the new management node. 一種用於在分布式系統中使管理能力自恢復的裝置, 所述分布式系統包括管理節點組和服務節點組,其特徵在於,所述裝置包含:第一模塊,用於如果監測到管理節點組內有管理節點發生故障,則將發生故障的管理節點從管理節點組中去除;第二模塊,用於從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;以及第三模塊,用於將新的管理節點加入所述管理節點組,其中,所述高可用性以設定時間週期內對所發送請求的應答成功率與應答平均時間的倒數的加權和來表徵並且將具有最高加權和的服務節點選擇為新的管理節點。 A device for self-recovery of management capabilities in a distributed system, The distributed system includes a management node group and a service node group, and is characterized in that the device includes: a first module, which is used to remove the failed management node from the management node group if it is detected that a management node in the management node group fails Removed from the management node group; a second module for selecting a service node with high availability from the service node group as a new management node; and a third module for adding a new management node to the management node group , Wherein the high availability is characterized by the weighted sum of the response success rate to the sent request and the reciprocal of the average response time within a set time period, and the service node with the highest weighted sum is selected as the new management node. 如申請專利範圍第3項所述的裝置,其中,所述裝置被部署在分布式系統的單個或多個節點內。 The device according to item 3 of the scope of patent application, wherein the device is deployed in a single or multiple nodes of a distributed system. 一種用於在分布式系統中使管理能力自恢復的裝置,所述分布式系統包括管理節點組和服務節點組,所述裝置包含存儲器、處理器以及存儲在所述存儲器上並可在所述處理器上運行的計算機程序,其特徵在於,執行如申請專利範圍第1或2項所述的方法。 A device for self-recovery of management capabilities in a distributed system. The distributed system includes a management node group and a service node group. The device includes a memory, a processor, and is stored in the memory and can be stored in the The computer program running on the processor is characterized in that it executes the method described in item 1 or 2 of the scope of patent application. 如申請專利範圍第5項所述的裝置,其中,所述裝置被部署在分布式系統的單個或多個節點內。 The device according to item 5 of the scope of patent application, wherein the device is deployed in a single or multiple nodes of a distributed system. 一種計算機可讀存儲介質,其上存儲計算機程序,其特徵在於,該程序被處理器執行時實現如申請專利範圍第1或2項所述的方法。 A computer-readable storage medium, on which a computer program is stored, characterized in that, when the program is executed by a processor, the method as described in item 1 or 2 of the scope of patent application is realized.
TW107144779A 2017-12-28 2018-12-12 Method and device for self-recovering management ability in distributed system TWI701916B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201711456829.2 2017-12-28
CN201711456829.2A CN108306760A (en) 2017-12-28 2017-12-28 For making the self-healing method and apparatus of managerial ability in a distributed system
??201711456829.2 2017-12-28

Publications (2)

Publication Number Publication Date
TW201931821A TW201931821A (en) 2019-08-01
TWI701916B true TWI701916B (en) 2020-08-11

Family

ID=62867775

Family Applications (1)

Application Number Title Priority Date Filing Date
TW107144779A TWI701916B (en) 2017-12-28 2018-12-12 Method and device for self-recovering management ability in distributed system

Country Status (3)

Country Link
CN (1) CN108306760A (en)
TW (1) TWI701916B (en)
WO (1) WO2019128670A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108306760A (en) * 2017-12-28 2018-07-20 中国银联股份有限公司 For making the self-healing method and apparatus of managerial ability in a distributed system
CN109218077A (en) * 2018-08-14 2019-01-15 阿里巴巴集团控股有限公司 Prediction technique, device, electronic equipment and the storage medium of target device
RU2716558C1 (en) 2018-12-13 2020-03-12 Алибаба Груп Холдинг Лимитед Performing modification of primary node in distributed system
CA3050560C (en) 2018-12-13 2020-06-30 Alibaba Group Holding Limited Performing a recovery process for a network node in a distributed system
RU2723072C1 (en) 2018-12-13 2020-06-08 Алибаба Груп Холдинг Лимитед Achieving consensus between network nodes in distributed system
CN110351133B (en) * 2019-06-28 2021-09-17 创新先进技术有限公司 Method and device for main node switching processing in block chain system
US10944624B2 (en) * 2019-06-28 2021-03-09 Advanced New Technologies Co., Ltd. Changing a master node in a blockchain system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152434A (en) * 2013-03-27 2013-06-12 江苏辰云信息科技有限公司 Leader node replacing method of distributed cloud system
CN107105032A (en) * 2017-04-20 2017-08-29 腾讯科技(深圳)有限公司 node device operation method and node device
CN107453929A (en) * 2017-09-22 2017-12-08 中国联合网络通信集团有限公司 Group system is from construction method, device and group system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105406980B (en) * 2015-10-19 2018-06-05 浪潮(北京)电子信息产业有限公司 A kind of multinode backup method and device
CN108306760A (en) * 2017-12-28 2018-07-20 中国银联股份有限公司 For making the self-healing method and apparatus of managerial ability in a distributed system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152434A (en) * 2013-03-27 2013-06-12 江苏辰云信息科技有限公司 Leader node replacing method of distributed cloud system
CN107105032A (en) * 2017-04-20 2017-08-29 腾讯科技(深圳)有限公司 node device operation method and node device
CN107453929A (en) * 2017-09-22 2017-12-08 中国联合网络通信集团有限公司 Group system is from construction method, device and group system

Also Published As

Publication number Publication date
TW201931821A (en) 2019-08-01
WO2019128670A1 (en) 2019-07-04
CN108306760A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
TWI701916B (en) Method and device for self-recovering management ability in distributed system
US9800087B2 (en) Multi-level data center consolidated power control
US10429914B2 (en) Multi-level data center using consolidated power control
CN105187249B (en) A kind of fault recovery method and device
US9063787B2 (en) System and method for using cluster level quorum to prevent split brain scenario in a data grid cluster
WO2018072618A1 (en) Method for allocating stream computing task and control server
CN111290834A (en) Method, device and equipment for realizing high availability of service based on cloud management platform
CN107508694B (en) Node management method and node equipment in cluster
CN103647830A (en) Dynamic management method for multilevel configuration files in cluster management system
CN112217847B (en) Micro service platform, realization method thereof, electronic equipment and storage medium
CN102394914A (en) Cluster brain-split processing method and device
CN109245926B (en) Intelligent network card, intelligent network card system and control method
CN109361777B (en) Synchronization method, synchronization system and related device for distributed cluster node states
CN114338670B (en) Edge cloud platform and network-connected traffic three-level cloud control platform with same
CN110519354A (en) Distributed object storage system and service processing method and storage medium thereof
CN107071189B (en) Connection method of communication equipment physical interface
CN113326100B (en) Cluster management method, device, equipment and computer storage medium
CN114327911A (en) Remote multi-activity implementation method of distributed service cluster and distributed service system
CN115766405A (en) Fault processing method, device, equipment and storage medium
CN111416726B (en) Resource management method, sending end equipment and receiving end equipment
KR101909264B1 (en) System and method for fault recovery of controller in separated SDN controller
WO2011135628A1 (en) Cluster reconfiguration method, cluster reconfiguration device and cluster reconfiguration program
CN111984376B (en) Protocol processing method, device, equipment and computer readable storage medium
CN116991591B (en) Data scheduling method, device and storage medium
US11947431B1 (en) Replication data facility failure detection and failover automation