分布式架構已經成為當前信息系統架構發展的主要趨勢。現有的分布式架構中一般採用管理節點加上服務節點的方式進行設計,管理節點的高可用加上服務節點的高可用構成了整個系統的高可用性。
在分布式系統中,管理節點負責支撐和保障服務節點的高可用性,其一般採用雙節點、三節點集群等方式。服務節點由管理節點來保障服務的高可用性,以當服務節點發生故障時不影響整體服務能力。顯然,管理節點的高可用性是整個系統的核心。
在現有的技術方案中,當管理節點發生故障時,通常通過主備切換或者雙活服務的切換來保證服務的延續性。但是,在系統切換後,原先的高可用性將不復存在,這一般需要通過手動方式進行恢復,因此效率較低。Distributed architecture has become the main trend in the development of current information system architecture. The existing distributed architecture generally adopts a management node plus a service node for design. The high availability of the management node and the high availability of the service node constitute the high availability of the entire system.
In a distributed system, the management node is responsible for supporting and ensuring the high availability of the service node, which generally adopts a two-node, three-node cluster, etc. The service node is guaranteed by the management node to ensure the high availability of the service, so that when the service node fails, the overall service capability is not affected. Obviously, the high availability of the management node is the core of the entire system.
In the existing technical solutions, when the management node fails, the continuity of the service is usually ensured by switching between active and standby or switching between active and active services. However, after the system is switched, the original high availability will no longer exist, which generally requires manual recovery, so the efficiency is low.
本發明的一個目的是提供一種用於在分布式系統中使管理能力自恢復的方法和裝置,其具有實施便捷和恢復能力強等優點。
在按照本發明一個方面的用於在分布式系統中使管理能力自恢復的方法中,所述分布式系統包括管理節點組和服務節點組,所述方法包含下列步驟:
如果監測到管理節點組內有管理節點發生故障,則將發生故障的管理節點從管理節點組中去除;
從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;以及
將新的管理節點加入所述管理節點組。
優選地,在上述方法中,選擇新的管理節點的步驟包括:
使每個服務節點向分布式架構中的其餘節點發送請求;
使收到請求的節點返回基於區塊鏈記帳確認機制的記帳確認應答;以及
根據對每個服務節點發送的請求的確認應答選擇具有高可用性的服務節點作為新的管理節點。
優選地,在上述方法中,所述高可用性以設定時間週期內對所發送請求的應答成功率和/或應答平均時間來表示並且將設定時間週期內具有較高應答成功率和/或應答平均時間的服務節點選擇為新的管理節點。
優選地,在上述方法中,選擇新的管理節點的步驟包括:
獲取每個服務節點在提供服務的過程中的網絡通信數據;以及
根據每個服務節點的網絡通信數據選擇具有高可用性的服務節點作為新的管理節點。
優選地,在上述方法中,所述高可用性以設定時間週期內每個服務節點與其它節點之間的響應時間的平均值並且將具有最短的響應時間的平均值的服務節點選擇為新的管理節點。
優選地,在上述方法中,選擇新的管理節點的步驟包括:
獲取每個服務節點在提供服務的過程中的資源使用情況;以及
根據每個服務節點的資源使用情況選擇具有高可用性的服務節點作為新的管理節點。
優選地,在上述方法中,所述高可用性以設定時間週期內每個服務節點的平均資源利用率來表示並且將具有最低資源利用率的服務節點選擇為新的管理節點。
在按照本發明另一個方面的用於在分布式系統中使管理能力自恢復的裝置中,所述分布式系統包括管理節點組和服務節點組,所述裝置包含:
第一模塊,用於如果監測到管理節點組內有管理節點發生故障,則將發生故障的管理節點從管理節點組中去除;
第二模塊,用於從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;以及
第三模塊,用於將新的管理節點加入所述管理節點組。
在按照本發明另一個方面的用於在分布式系統中使管理能力自恢復的裝置中,所述分布式系統包括管理節點組和服務節點組,所述裝置包含存儲器、處理器以及存儲在所述存儲器上並可在所述處理器上運行的計算機程序以執行如上所述的方法。
本發明的還有一個目的是提供一種計算機可讀存儲介質,其上存儲計算機程序,該程序被處理器執行時實現如上所述的方法。
與現有技術相比,本發明具有諸多優點。例如,當分布式系統因管理節點發生故障而切換模式後,按照本發明上述各個方面的方法和裝置能夠快速自動恢復系統原來的高可用性,從而極大提升系統維護的效率並且提高了系統高可用性的保障程度。An object of the present invention is to provide a method and device for self-recovery of management capabilities in a distributed system, which has the advantages of convenient implementation and strong recovery capabilities.
In a method for self-recovering management capabilities in a distributed system according to an aspect of the present invention, the distributed system includes a management node group and a service node group, and the method includes the following steps:
If it is detected that a management node in the management node group fails, the management node that has failed is removed from the management node group;
Selecting a service node with high availability from the service node group as the new management node; and
Add a new management node to the management node group.
Preferably, in the above method, the step of selecting a new management node includes:
Make each service node send requests to the rest of the nodes in the distributed architecture;
Make the node that receives the request return an accounting confirmation response based on the blockchain accounting confirmation mechanism; and
According to the confirmation response to the request sent by each service node, the service node with high availability is selected as the new management node.
Preferably, in the above method, the high availability is expressed in terms of the response success rate and/or average response time to the sent request within a set time period and will have a higher response success rate and/or average response time within the set time period The time service node is selected as the new management node.
Preferably, in the above method, the step of selecting a new management node includes:
Obtain the network communication data of each service node in the process of providing services; and
According to the network communication data of each service node, a service node with high availability is selected as the new management node.
Preferably, in the above method, the high availability is based on the average value of the response time between each service node and other nodes in a set time period, and the service node with the shortest average response time is selected as the new management node.
Preferably, in the above method, the step of selecting a new management node includes:
Obtain the resource usage of each service node in the process of providing services; and
According to the resource usage of each service node, a service node with high availability is selected as the new management node.
Preferably, in the above method, the high availability is represented by the average resource utilization rate of each service node in a set time period, and the service node with the lowest resource utilization rate is selected as the new management node.
In a device for self-recovering management capabilities in a distributed system according to another aspect of the present invention, the distributed system includes a management node group and a service node group, and the device includes:
The first module is used to remove the failed management node from the management node group if it is detected that a management node in the management node group has failed;
The second module is used to select a service node with high availability from the service node group as a new management node; and
The third module is used to add a new management node to the management node group.
In a device for self-recovering management capabilities in a distributed system according to another aspect of the present invention, the distributed system includes a management node group and a service node group, and the device includes a memory, a processor, and A computer program on the memory and executable on the processor is used to execute the method as described above.
Another object of the present invention is to provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method described above is realized.
Compared with the prior art, the present invention has many advantages. For example, when a distributed system switches modes due to a failure of a management node, the method and device according to the above aspects of the present invention can quickly and automatically restore the original high availability of the system, thereby greatly improving the efficiency of system maintenance and improving the high availability of the system. Degree of protection.
下面參照其中圖示了本發明示意性實施例的附圖更為全面地說明本發明。但本發明可以按不同形式來實現,而不應解讀為僅限於本文給出的各實施例。給出的上述各實施例旨在使本文的披露全面完整,以將本發明的保護範圍更為全面地傳達給本領域技術人員。
在本說明書中,諸如“包含”和“包括”之類的用語表示除了具有在說明書和權利要求書中有直接和明確表述的單元和步驟以外,本發明的技術方案也不排除具有未被直接或明確表述的其它單元和步驟的情形。
圖1為一種分布式系統的架構示意圖。示例性地,圖1所示的分布式系統10包括管理節點110a和110b(這些節點們組成管理節點組)和服務節點120a-120h(這些節點組成服務節點組)。在所示的分布式系統中,各個節點可以直接實現通信連接,或者經第三方節點實現通信連接。
管理節點組在正常情況下處于高可用模式。這裡所述的高可用模式包括雙活模式、多活模式和主備模式,其中,在雙活模式和多活模式下,管理節點組內各個管理節點(例如節點110a和110b)都處於激活狀態;在主備模式模式下,管理節點110a和110b的其中一個管理節點(例如節點110a)被指定為主節點而其餘的管理節點(例如110b)被指定為備用節點。
當監測到管理節點組內有管理節點(例如節點110a)發生故障時,為了確保服務的正常提供,發生故障的管理節點將從管理節點組中去除。在上面的例子中,管理節點110b將成為唯一可用的管理節點。此時,諸如雙活模式和主備模式之類的高可用模式將不再可用,從而影響到整個分布式系統10的高可用性。
按照本發明的一個方面,為了使分布式系統10恢復高可用性,可以從服務節點組中挑選合適的服務節點(例如服務節點120a)作為新的管理節點來替代發生故障的管理節點,從而使管理節點組再次進入高可用模式。例如在新的主備模式下,節點110b和120a分別作為主節點和備用節點;在雙活模式下,節點110b和120a互為備份。
圖2為按照本發明一個實施例的用於在分布式系統中使管理能力自恢復的方法的流程圖。示例性地,這裡以圖1所示的分布式系統為例來描述本實施例的方法。但是需要指出的是,本實施例的方法不局限於特定架構的分布式系統。需要指出的是,本實施例方法的各個步驟可以由部署在分布式系統20中的一個或多個節點上的硬件設備或軟件模塊單獨或協同執行,也可以由分布式系統20中的獨立於各個節點的設備或模塊執行。
參見圖2,在步驟S210,監測管理節點組內是否有管理節點發生故障,如果監測到有發生故障的管理節點(例如節點110a),則進入步驟S220,否則繼續監測。
在步驟S220,將發生故障的管理節點110a從管理節點組中去除,此時對於服務節點120a-120h來說,僅有管理節點110b負責支撐和保障服務節點,因此高可用模式不可用。
隨後進入步驟S230,從服務節點組中選擇具有高可用性的服務節點(例如節點120a)作為新的管理節點。以下將對選擇的方式作詳細描述。
接著進入步驟S240,將新的管理節點120a加入管理節點組。由此,管理節點組可再次進入高可用模式。例如在主備模式下,節點110b和120a分別作為主節點和備用節點;在雙活模式下,節點110b和120a互為備份。
圖3為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。本實施例可作為實施圖2所示方法中的步驟S230的具體方式。
如圖3所示,在步驟S310,使每個服務節點向分布式架構中的其餘節點(包括管理節點和服務節點)發送請求。
隨後進入步驟S320,使收到請求的節點返回基於區塊鏈記帳確認機制的記帳確認應答。
接著進入步驟S330,根據對每個服務節點發送的請求的確認應答選擇具有高可用性的服務節點作為新的管理節點。
在步驟S330中,優選地,高可用性可以以設定時間週期內對所發送請求的應答成功率和/或應答平均時間來表示,並且可以將設定時間週期內具有較高應答成功率和/或應答平均時間的服務節點選擇為新的管理節點。示例性地,可以為每個服務節點確定應答成功率與應答平均時間組合的得分(該得分例如可以是應答平均時間的倒數與應答成功率的加權和),並選擇得分最高的服務節點作為新的管理節點。
圖4為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。本實施例可作為實施圖2所示方法中的步驟S230的具體方式。
如圖4所示,在步驟S410,獲取每個服務節點在提供服務的過程中的網絡通信數據。
隨後進入步驟S420,根據每個服務節點的網絡通信數據選擇具有高可用性的服務節點作為新的管理節點。
在步驟S420中,優選地,高可用性可以以設定時間週期內每個服務節點與其它節點之間的響應時間的平均值來表示,並且將具有最短的響應時間的平均值的服務節點選擇為新的管理節點。
圖5為按照本發明另一個實施例的選擇新的管理節點的方法的流程圖。本實施例可作為實施圖2所示方法中的步驟S230的具體方式。
如圖5所示,在步驟S510,獲取每個服務節點在提供服務的過程中的資源使用情況。
隨後進入步驟S520,根據每個服務節點的資源使用情況選擇具有高可用性的服務節點作為新的管理節點。
在步驟S520中,優選地,高可用性可以以高可用性以設定時間週期內每個服務節點的平均資源利用率來表示,並且將具有最低資源利用率的服務節點選擇為新的管理節點。
圖6為按照本發明另一個實施例的用於在分布式系統中使管理能力自恢復的裝置的框圖。
如圖6所示,本實施例的用於在分布式系統中使管理能力自恢復的裝置60包括第一模塊610、第二模塊620和第三模塊630。第一模塊610用於如果監測到管理節點組內有管理節點發生故障,則將發生故障的管理節點從管理節點組中去除;第二模塊620用於從所述服務節點組中選擇具有高可用性的服務節點作為新的管理節點;第三模塊630用於將新的管理節點加入所述管理節點組。
圖7為按照本發明另一個實施例的用於在分布式系統中使管理能力自恢復的裝置的框圖。
圖7所示的裝置70包含存儲器70、處理器720以及存儲在存儲器70上並可在處理器720上運行的計算機程序730,其中,計算機程序730通過在處理器720上運行以可執行如上借助圖2-5所述實施例的方法。
按照本發明的一個方面,提供一種計算機可讀存儲介質,其上存儲計算機程序,該程序被處理器執行時實現借助圖2-5所述實施例的方法。
提供本文中提出的實施例和示例,以便最好地說明按照本技術及其特定應用的實施例,並且由此使本領域的技術人員能夠實施和使用本發明。但是,本領域的技術人員將會知道,僅為了便於說明和舉例而提供以上描述和示例。所提出的描述不是意在涵蓋本發明的各個方面或者將本發明局限於所公開的精確形式。
鑒於以上所述,本公開的範圍通過以下權利要求書來確定。Hereinafter, the present invention will be explained more fully with reference to the accompanying drawings in which exemplary embodiments of the present invention are illustrated. However, the present invention can be implemented in different forms and should not be interpreted as being limited to the embodiments given herein. The above-mentioned embodiments are provided to make the disclosure herein comprehensive and complete, so as to more comprehensively convey the protection scope of the present invention to those skilled in the art.
In this specification, terms such as "including" and "including" mean that in addition to units and steps that are directly and clearly stated in the specification and claims, the technical solution of the present invention does not exclude Or other units and steps clearly stated.
Figure 1 is a schematic diagram of a distributed system architecture. Illustratively, the distributed system 10 shown in FIG. 1 includes management nodes 110a and 110b (these nodes form a management node group) and service nodes 120a-120h (these nodes form a service node group). In the distributed system shown, each node can directly realize a communication connection, or realize a communication connection via a third-party node.
The management node group is in a high availability mode under normal circumstances. The high-availability modes mentioned here include active-active mode, active-active mode, and active-standby mode. In active-active mode and active-active mode, each management node in the management node group (for example, nodes 110a and 110b) are all active. ; In the active-standby mode, one of the management nodes 110a and 110b (for example, node 110a) is designated as the master node and the remaining management nodes (for example, 110b) are designated as standby nodes.
When it is detected that a management node (for example, node 110a) in the management node group has failed, in order to ensure the normal provision of services, the failed management node will be removed from the management node group. In the above example, the management node 110b will become the only available management node. At this time, high availability modes such as the active-active mode and the active-standby mode will no longer be available, thereby affecting the high availability of the entire distributed system 10.
According to one aspect of the present invention, in order to restore the high availability of the distributed system 10, a suitable service node (for example, the service node 120a) can be selected from the service node group as a new management node to replace the failed management node, so that the management The node group enters the high availability mode again. For example, in the new active-standby mode, the nodes 110b and 120a serve as the master node and the standby node, respectively; in the dual-active mode, the nodes 110b and 120a serve as backups for each other.
Fig. 2 is a flowchart of a method for self-recovering management capabilities in a distributed system according to an embodiment of the present invention. Exemplarily, the distributed system shown in FIG. 1 is taken as an example to describe the method of this embodiment. However, it should be pointed out that the method of this embodiment is not limited to a distributed system with a specific architecture. It should be pointed out that the steps of the method in this embodiment can be executed individually or cooperatively by hardware devices or software modules deployed on one or more nodes in the distributed system 20, or can be executed independently of the hardware devices or software modules in the distributed system 20. The equipment or modules of each node are executed.
Referring to Fig. 2, in step S210, it is monitored whether any management node in the management node group has failed. If a failed management node (such as node 110a) is detected, step S220 is entered, otherwise the monitoring is continued.
In step S220, the failed management node 110a is removed from the management node group. At this time, for the service nodes 120a-120h, only the management node 110b is responsible for supporting and guaranteeing the service node, so the high availability mode is not available.
Then enter step S230, select a highly available service node (for example, node 120a) from the service node group as the new management node. The selection method will be described in detail below.
Then go to step S240 to add the new management node 120a to the management node group. Thus, the management node group can enter the high availability mode again. For example, in the active-standby mode, the nodes 110b and 120a serve as the master node and the standby node, respectively; in the dual-active mode, the nodes 110b and 120a serve as backups for each other.
Fig. 3 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. This embodiment can be used as a specific way to implement step S230 in the method shown in FIG. 2.
As shown in FIG. 3, in step S310, each service node is caused to send a request to the remaining nodes (including the management node and the service node) in the distributed architecture.
Then step S320 is entered to make the node receiving the request return a billing confirmation response based on the blockchain billing confirmation mechanism.
Then enter step S330, and select a highly available service node as a new management node according to the confirmation response to the request sent by each service node.
In step S330, preferably, high availability can be expressed in terms of the response success rate and/or average response time to the sent request within a set time period, and may have a high response success rate and/or response time within the set time period. The average time service node is selected as the new management node. Exemplarily, the score of the combination of the response success rate and the average response time can be determined for each service node (the score can be, for example, the weighted sum of the reciprocal of the average response time and the response success rate), and the service node with the highest score can be selected as the new Management node.
Fig. 4 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. This embodiment can be used as a specific way to implement step S230 in the method shown in FIG. 2.
As shown in FIG. 4, in step S410, network communication data of each service node in the process of providing services is acquired.
Then, step S420 is entered, and a service node with high availability is selected as a new management node according to the network communication data of each service node.
In step S420, preferably, high availability can be represented by the average value of the response time between each service node and other nodes in a set time period, and the service node with the shortest average response time is selected as the new Management node.
Fig. 5 is a flowchart of a method for selecting a new management node according to another embodiment of the present invention. This embodiment can be used as a specific way to implement step S230 in the method shown in FIG. 2.
As shown in Figure 5, in step S510, the resource usage of each service node in the process of providing services is obtained.
Then, step S520 is entered, and a service node with high availability is selected as a new management node according to the resource usage of each service node.
In step S520, preferably, high availability can be represented by high availability by the average resource utilization rate of each service node in a set time period, and the service node with the lowest resource utilization rate is selected as the new management node.
Fig. 6 is a block diagram of an apparatus for self-recovering management capabilities in a distributed system according to another embodiment of the present invention.
As shown in FIG. 6, the apparatus 60 for self-recovering management capabilities in a distributed system of this embodiment includes a first module 610, a second module 620, and a third module 630. The first module 610 is used to remove the failed management node from the management node group if it is detected that a management node in the management node group is faulty; the second module 620 is used to select high availability from the service node group The service node of is used as the new management node; the third module 630 is used to add the new management node to the management node group.
Fig. 7 is a block diagram of an apparatus for self-recovering management capabilities in a distributed system according to another embodiment of the present invention.
The device 70 shown in FIG. 7 includes a memory 70, a processor 720, and a computer program 730 that is stored on the memory 70 and can be run on the processor 720. The computer program 730 runs on the processor 720 to execute the above The method of the embodiment described in Figures 2-5.
According to one aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the method according to the embodiments described in FIGS. 2-5 is implemented.
The embodiments and examples presented herein are provided in order to best illustrate the embodiments according to the present technology and its specific applications, and thereby enable those skilled in the art to implement and use the present invention. However, those skilled in the art will know that the above description and examples are provided only for ease of description and examples. The presented description is not intended to cover every aspect of the invention or to limit the invention to the precise form disclosed.
In view of the foregoing, the scope of the present disclosure is determined by the following claims.