WO2019011018A1 - 集群中节点的故障处理方法及设备 - Google Patents

集群中节点的故障处理方法及设备 Download PDF

Info

Publication number
WO2019011018A1
WO2019011018A1 PCT/CN2018/082663 CN2018082663W WO2019011018A1 WO 2019011018 A1 WO2019011018 A1 WO 2019011018A1 CN 2018082663 W CN2018082663 W CN 2018082663W WO 2019011018 A1 WO2019011018 A1 WO 2019011018A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
cluster
nodes
sub
fault
Prior art date
Application number
PCT/CN2018/082663
Other languages
English (en)
French (fr)
Inventor
曾艳
于璠
王胤文
帅煜韬
岳晓明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18832228.3A priority Critical patent/EP3627767B1/en
Priority to CA3066853A priority patent/CA3066853A1/en
Publication of WO2019011018A1 publication Critical patent/WO2019011018A1/zh
Priority to US16/732,749 priority patent/US11115263B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0686Additional information in the notification, e.g. enhancement of specific meta-data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1425Reconfiguring to eliminate the error by reconfiguration of node membership
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/40Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using virtualisation of network functions or resources, e.g. SDN or NFV entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • H04L43/0829Packet loss
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/20Arrangements for monitoring or testing data switching networks the monitoring system or the monitored elements being virtualised, abstracted or software-defined entities, e.g. SDN or NFV
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a fault processing method and device for a node in a cluster.
  • ICT Information Communication Technology
  • SDN Software Defined Network
  • Emulex network is an implementation of network virtualization.
  • the SDN implementation of distributed or product cloudization inevitably needs to solve problems such as distributed cluster management.
  • the SDN service has higher requirements on the communication capabilities of the cluster nodes, and the decentralized architecture of the Akka cluster ensures the communication capability between the nodes. Therefore, some companies have adopted the Akka cluster to build the clustering capabilities of SDN controllers.
  • the nodes in the Akka cluster are divided into leader nodes and non-leader nodes.
  • the function of the leader node is responsible for the node joining the cluster or kicking out the cluster. Other functions are consistent with the common nodes.
  • Akka's fault detection mechanism is that if a link fails or a node loses packets randomly, the Akka cannot detect whether the link is faulty or the node loses the packet, but the heartbeat detects that the nodes cannot communicate with each other. To ensure reliability, the nodes at both ends of the link are kicked out, resulting in a high probability of kicking the correct node out.
  • Akka's fault detection and processing mechanism in the case of a poor network, in the case of a link or node failure (hanging or severe packet loss) scenario, the high probability of the fault handling strategy of the Akka cluster will cause a false kick, so that the remaining clusters The number of nodes is not more than half, and the business often needs more than half of the cluster nodes to operate normally. Eventually, the high probability of the cluster is unavailable, which increases service migration and failure recovery overhead, and has a serious impact on users using SDN.
  • the industry technology often changes from the gossip algorithm itself or from the perspective of fault detection.
  • the industry generally adopts a method in which the node i entrusts other nodes to ping the node when the heartbeat between the node i and the node j is unreachable. j. If any node in the delegation node can ping with the node j, the node j is considered to be reachable. This method increases the time overhead of fault detection and the amount of data synchronized by the gossip algorithm.
  • the embodiment of the present invention provides a fault processing method and a device for a node in a cluster, which can accurately locate a faulty node and a faulty link when a link in a cluster fails, and eliminate a fault in the cluster at a small cost. Reduces the probability that the cluster will restart and the service is unavailable.
  • an embodiment of the present application provides a method for processing a fault in a node in a cluster.
  • the method includes: acquiring fault detection topology information of a cluster, where one node in the cluster is detected by at least one other node in the cluster, and the fault detection usually sends a detection packet (usually a heartbeat packet) to the detected node by the detecting node.
  • a detection packet usually a heartbeat packet
  • the fault detection topology information includes a fault detection relationship between the detection node and the detected node in the cluster; and the fault indication message is obtained from the detection node, where the fault indication message is used to indicate that the detection node is unreachable to the detected node; Determining the sub-cluster in the cluster according to the fault detection topology information and the fault indication message, where the nodes belonging to the different sub-cluster are mutually unreachable; determining the working cluster according to the sub-cluster of the cluster, where the cluster in the embodiment of the present application Can be a decentralized cluster.
  • the acquiring the fault detection topology information of the cluster may include: receiving fault detection topology information sent by other nodes in the cluster; or calculating the fault detection topology information based on a preset rule.
  • the fault detection topology information and the fault indication message of the cluster are combined to determine the sub-cluster formed after the cluster is rectified, and the working cluster is determined according to the sub-cluster, and the most An excellent sub-cluster acts as a working cluster. For example, you can select a sub-cluster that contains the largest number of nodes, and directly delete the two nodes at both ends of the failed link, or use additional detection to determine the node to be deleted.
  • any node in the cluster can serve as the role of the detection node or the role of the detected node, or assume the roles of the detection node and the detected node in different fault detection relationships.
  • the sub-cluster is determined by using the fault detection topology information and the fault indication message, and the available sub-cluster generated after the cluster fails may be determined at a small cost, and the available nodes in the cluster are retained to the greatest extent, and the cluster is improved.
  • the number of available nodes ensures high availability, reduces the probability of cluster restart, service unavailability, and reduces the cost of fault recovery and service migration.
  • the working cluster is determined based on the number of nodes. For example, determine that the sub-cluster with the largest number of nodes is a working cluster
  • the working cluster is determined according to the seed node. For example, determining that the sub-cluster with the most seed nodes is a working cluster;
  • the working cluster is determined according to the node running the main business. For example, determining that the sub-cluster with the most nodes running the main service is a working cluster;
  • the working cluster is determined based on the health status or the available resource status. For example, the sub-cluster that determines that the health status meets the preset condition has the largest number of nodes is the working cluster; or the sub-cluster that determines that the available resource status meets the preset condition is the working cluster.
  • the above enumerated manners may be simultaneously adopted by multiple considerations, for example, determining that a seed node is included, and the number of nodes is the largest sub-cluster as a working cluster; or determining that the health status satisfies a preset condition, and the number of nodes running the main service is the largest. Cluster not working cluster, etc.
  • the embodiment of the present application can determine the faulty link and the faulty node more reasonably based on the more comprehensive information, so as to determine the working cluster more reasonably.
  • the fault detection topology information and the fault indication message when determining the sub-cluster in the cluster, marking the fault indication message according to the network topology diagram formed by the network connection relationship of the nodes in the cluster And the faulty node and the faulty link determined by the fault detection topology information, thereby determining the faulty link and the faulty node in the cluster, and determining the sub-cluster that the nodes are unreachable according to the determined faulty link and the faulty node.
  • the network topology diagram based on the graph theory, the faulty link and the faulty node can be determined more reasonably according to more comprehensive information, so as to determine the working cluster more reasonably.
  • the node is a fault node; if a node is Detected by the node, and according to the fault indication message, detecting that the detecting node of the node is partially unreachable to the node, and the part detecting node is still reachable to the node, the link between the unreachable detecting node and the node is Faulty link.
  • the sub-cluster with the lowest cost and ensuring normal working in the sub-cluster can be selected as the working cluster, wherein the attribute of the sub-cluster inheriting the cluster is more The more the cost, the smaller the number of nodes, the number of seed nodes, the main business running, and so on.
  • determining that the working cluster includes any one or more of the following methods: determining that the sub-cluster with the largest number of nodes is a working cluster; determining that the sub-cluster that includes the seed node and having the largest number of nodes is working Cluster; determine that the sub-cluster with the most seed nodes is the working cluster; determine that the sub-cluster with the most nodes running the main service is the working cluster; and determine the working cluster based on the health status or the available resource status.
  • the selection of the working cluster can be implemented to meet the conditions that the working cluster can operate normally, the healthiest or the minimum cost of affecting the existing service, and the like.
  • the determining, according to the fault detection topology information, and the fault indication message, determining the sub-cluster in the cluster includes: determining, according to the fault detection topology information, a topology map based on a fault detection relationship between the nodes, the slave topology
  • the edge corresponding to the fault indication message is deleted, the connected sub-picture of the deleted topology map is determined, and the sub-cluster is determined according to the connected sub-picture.
  • the connected component corresponding to the topology map or the sub-cluster corresponding to the strong connected component is determined to be working.
  • Cluster The embodiment of the present application can implement the determination of the sub-cluster by combining the topology map that can reflect the fault detection relationship of the cluster summary node, and the sub-cluster can be more conveniently and accurately determined through the connection relationship in the topology map.
  • the working cluster may further include a faulty link, that is, the detecting node and the detected node in the fault indication information are included in the working cluster.
  • a faulty link can be eliminated by deleting one or more nodes in the working cluster.
  • the node to be deleted can be determined according to a preset rule.
  • the method further includes: determining that the unreachable node in the unreachable node in the working cluster points to the most unreachable node as the node to be deleted; and sending a first indication message to the other nodes in the cluster, where the first indication message is used. Indicates the node to be deleted.
  • the elimination of the faulty link in the working cluster can be further implemented.
  • the node that is to be deleted is determined by determining that the fault indication message points to the most unreachable node, and the two ends of the link are directly deleted from the prior art.
  • the node, or the additional detection to determine the node to be deleted the embodiment of the present application can retain the available nodes in the cluster to a minimum extent, and increase the number of available nodes in the cluster, ensuring high availability and reducing The probability of the cluster being restarted and the service being unavailable is reduced, which reduces the cost of fault recovery and service migration.
  • the execution body of this embodiment may be any one of the nodes in the cluster.
  • the node is referred to as a control node.
  • the deleted node can be the control node itself.
  • the control node if the control node is on the faulty link, it is necessary to determine whether it is the node to be deleted, so as to ensure that the information seen by each node is consistent, and the accuracy of the faulty node and the faulty link location is improved.
  • the determining that the unreachable node in the unreachable node in the working cluster points to the most unreachable node is the node to be deleted includes: determining a fault indication in the unreachable node in the working cluster The message points to the most unreachable node, and the one with the worst health status is the node to be deleted. The health status is determined based on the time when the node responds to the heartbeat message. According to the embodiment of the present application, the unreachable node can be deleted at a small cost according to the number of times the fault indication message is directed and the health status, and the probability of the service being unavailable is reduced.
  • the second aspect is a method for troubleshooting a node in a cluster.
  • the method includes: acquiring fault detection topology information of a cluster, where a node in the cluster is configured to perform fault detection by at least one other node in the cluster, where the fault detection topology information includes between the detected node and the detected node in the cluster
  • the fault detection relationship is obtained.
  • the fault indication message is used to indicate that the detected node is unreachable to the detected node.
  • the fault detection topology information it is determined that the fault indication message points to the most unreachable node as the node to be deleted.
  • the control node only needs to receive the fault detection message sent by other nodes in the cluster to determine the node to be deleted. Compared with the prior art, the control node does not need to directly detect other nodes, thereby reducing the control node. The burden. When any node in the cluster is selected as the control node, the impact on the original service of the node is reduced.
  • the determining that the unreachable node in the unreachable node in the working cluster points to the most unreachable node is the node to be deleted includes: determining a fault indication in the unreachable node in the working cluster The message points to the most unreachable node, and the one with the worst health status is the node to be deleted. The health status is determined based on the time when the node responds to the heartbeat message. According to the embodiment of the present application, the unreachable node can be deleted at a small cost according to the number of times the fault indication message is directed and the health status, and the probability of the service being unavailable is reduced.
  • determining the node to be deleted after determining the node to be deleted, determining, according to the fault detection topology information, a node that is deleted in the cluster and a sub-cluster formed by the node that is not deleted, where nodes belonging to different sub-clusters Unreachable; determine the working cluster based on the sub-cluster of the cluster.
  • the link in the cluster is faulty
  • the faulty node and the faulty link are accurately located according to the topology diagram, and the unreachable node in the cluster is deleted at a small cost, and the cluster restart is reduced, and the service is unavailable. The probability of failure recovery and business migration.
  • the sub-cluster when determining a sub-cluster, can be determined according to the method provided by the first aspect.
  • an embodiment of the present application provides a fault processing device.
  • the fault handling device has the function of implementing the behavior of the nodes in the above method. This function can be implemented in hardware or in hardware by executing the corresponding software.
  • the hardware or software includes one or more units corresponding to the functions described above.
  • the fault handling device includes a transceiver, a processor, and a memory for communicating with other nodes, the memory for the user to store data and programs.
  • the processor executes a memory stored computer to execute instructions to cause the fault handling device to perform various aspects as in the first aspect and the first aspect or in the second aspect and the second aspect The troubleshooting method in the selection mode.
  • the embodiment of the present application provides a computer readable storage medium, configured to store computer readable instructions for use in the foregoing fault processing device, including the foregoing first aspect and optionally implementation or The program designed in the two aspects and the various alternatives of the second aspect.
  • an embodiment of the present application provides a computer program product, configured to store computer readable instructions for use in the foregoing fault processing device, including the first aspect and optionally the implementation or the second aspect. And the program designed in various alternatives of the second aspect.
  • FIG. 1 is a schematic diagram of a network architecture
  • FIG. 2 is a schematic diagram of signaling interaction of a method for processing a node in a cluster according to an embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart of another method for processing a fault in a node in a cluster according to an embodiment of the present disclosure
  • Figure 4 is an example provided by an embodiment of the present application.
  • FIG. 5 is another example provided by an embodiment of the present application.
  • Figure 6 is still another example provided by the embodiment of the present application.
  • FIG. 7 is still another example provided by the embodiment of the present application.
  • FIG. 8 is still another example provided by the embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a fault processing apparatus according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a fault processing device according to an embodiment of the present disclosure.
  • the method and device for processing a fault in a cluster in the cluster may be applied to a decentralized cluster to deal with a fault of a node, specifically, deleting a faulty node by a node in the cluster, and further selecting a working cluster, etc. .
  • the cluster includes a plurality of nodes, wherein the node responsible for joining the cluster or kicking out the cluster is the control node 10, and the other nodes are the ordinary nodes 20.
  • the common node 20 in FIG. 1 includes the common node 201, Ordinary node 202, ordinary node 203, ordinary node 204, and the like.
  • the control node 10 is a node that meets a preset rule among the plurality of nodes included in the cluster.
  • the reachable node with the smallest IP+Port among the multiple nodes in the cluster is generally selected as the control node, and if the control node fails. Or the influence of other situations needs to re-determine the control node, and then the control node is selected among the remaining plurality of reachable nodes according to the rule.
  • any member node in the Akka cluster may become the leader of the cluster (ie, the control node), which is a deterministic result based on the Gossip Convergence process.
  • Leader is just a role, and Leader is constantly changing during each round of Gossip convergence.
  • the role of the Leader is to make member nodes (ie, ordinary nodes) enter or leave the cluster.
  • a member node starts in the joining state. Once all nodes see the node that newly joins the Akka cluster, the leader sets the state of the node to up.
  • the leader based on the Gossip protocol cannot perform any operation convergence (Convergence) to the node, so the state of the node in the unreachable state must be changed, it must become reachable or down. Status (that is, removed from the cluster). If the node in the down state wants to join the Akka cluster again, it needs to reboot and rejoin the cluster (via the joining state).
  • the nodes in the cluster identify faults through fault detection for fault handling.
  • the fault detection relationship between the node and the node may include one-way detection and two-way detection. Only one of the two nodes detects one-way detection, and the two nodes detect each other as two-way detection. For example, as shown in FIG. 1, if the control node 10 detects the normal node 202 and the normal node 202 detects the control node 10, the control nodes 10 and 202 are unidirectionally detected. If the control node 10 and the normal node 201 can detect each other, the control node 10 and the normal node 201 are bidirectionally detected.
  • the cluster shown in FIG. 1 is only an example.
  • the cluster in the embodiment of the present application may include more or fewer nodes, and the nodes in the cluster may also include other connection manners.
  • the application provides a fault processing method, device and device for nodes in a cluster.
  • the sub-cluster formed after the cluster is rectified is determined, and the working cluster is determined according to the sub-cluster, and the optimal sub-cluster can be selected as the working cluster.
  • the number of nodes to be included can be selected.
  • the most sub-cluster in relation to directly deleting two nodes at both ends of the faulty link, or determining the node to be deleted by additional detection, the embodiment of the present application can retain the available nodes in the cluster to a minimum extent with a small cost.
  • the number of available nodes in the cluster is increased, the high availability is ensured, the probability of the cluster being restarted, the service being unavailable is reduced, and the cost of fault recovery and service migration is reduced.
  • FIG. 2 is a schematic diagram of signaling interaction of a fault processing method of a node in a cluster according to an embodiment of the present disclosure. As shown in FIG. 2, the method specifically includes:
  • the control node acquires fault detection topology information of the cluster, where the topology information includes a fault detection relationship between the detected node and the detected node in the cluster.
  • One node in the cluster is fault detected by at least one other node in the cluster.
  • the fault detection topology information of the cluster can be determined according to the node detection mechanism. Specifically, each node sends a fault detection topology message to the control node, and each fault detection topology message is used to indicate one or more nodes or nodes that are faulty detected by one node.
  • the control node receives multiple fault detection topology messages sent by other nodes in the cluster, and combines all the nodes detected by the fault to obtain the fault detection relationship among all nodes in the cluster.
  • the fault detection topology information in the cluster can also be estimated according to preset rules. For example, number the nodes in the cluster and determine the default rule for each node to detect two nodes of the last two digits of the node number. After the control node determines the node number and the preset rule in the cluster, the control node can estimate the fault detection relationship between the nodes, thereby obtaining the fault detection topology information.
  • the control node acquires a fault indication message from the detecting node, where the fault indication message is used to indicate that the detecting node is not detected to be reachable.
  • the node may send a fault indication message to the node (including the control node) connected thereto, where the detecting node is itself and the detected node detects it. Node.
  • Each node in the cluster can maintain a global cluster node list according to the fault indication message, and the list can reflect the state of each node in the cluster, wherein the state of the node includes whether the node is detected as unreachable, and which node is detected by the node. It is unreachable.
  • unreachable can also be considered as an additional flag of a node state, which is used to identify that a node in the cluster cannot communicate normally with the node.
  • the control node determines, according to the fault detection topology information and the fault indication message, the sub-cluster in the cluster, where the nodes belonging to the different sub-cluster are mutually unreachable.
  • the fault detection message can only determine that the detected node is unreachable, it cannot be determined whether it is unreachable due to a link failure between the detected node and the detected node, or unreachable due to the detected node failure. Therefore, on the basis of acquiring the fault detection topology information formed by the fault detection relationship between all the nodes, according to the fault indication message, the fault detection topology map including the current fault indication message can be obtained in combination. Based on the theory of graph theory, it is possible to determine the sub-cluster into which the cluster splits after a failure.
  • the fault detection relationship topology map may be formed on the topology information based on the fault detection relationship, where each directed edge in the fault detection relationship topology map indicates a detection relationship between the detection node and the detected node.
  • the edge between the detection node corresponding to the fault indication message and the detected node is marked as a fault edge, thereby obtaining a fault detection relationship topology diagram including the current fault edge.
  • faulty links and faulty nodes in the cluster can be determined.
  • the detected link corresponding to the fault edge is a fault link; if the unidirectional edge pointing to the same node is The fault edge, the node is a faulty node. According to the determined fault node and the fault edge, it can be determined whether the cluster is split into sub-clusters that are mutually unreachable.
  • the sub-cluster is the working cluster.
  • the working cluster can be determined according to one or more of the following methods:
  • the sub-cluster that determines the node with the main service is the working cluster.
  • the main service can refer to the key service in all services or the main service in the active/standby service.
  • the cluster can be divided into a seed node and a non-seed node.
  • the seed node is statically configured in the configuration file, and is a contact point where the ordinary node joins the cluster.
  • Non-seed nodes ordinary nodes are dynamically joined to the cluster through the seed node.
  • control node can determine the working cluster according to the following conditions.
  • a sub-cluster with a large number of reserved nodes and at least one seed node is reserved as a working cluster. If the number of nodes is the largest and the sub-cluster containing at least one seed node has a unique one, it is determined that the sub-cluster with the largest number of nodes in the sub-cluster and at least one seed node is a working cluster. If the sub-cluster with the largest number of nodes includes at least two, the number of the at least two sub-cluster nodes is the same, and both the seed nodes are included, the sub-cluster with more seed nodes in at least two sub-cluster is reserved as the working cluster.
  • the sub-cluster with the largest number of nodes and the largest number of seed nodes is the working cluster. If the number of nodes is the largest and the sub-cluster that includes the most seed nodes includes at least two, and the number of the at least two sub-cluster nodes and the number of the seed nodes included are the same, it is determined that the sub-cluster in which the ip+port minimum node is located in the seed node is a working cluster.
  • the control node determines the working cluster, there may be a faulty link in the working cluster.
  • the faulty link does not cause the cluster to split, that is, the cluster is not divided into multiple sub-clusters.
  • the control node can also eliminate the faulty link in the working cluster by making a decision, and specifically includes the following steps:
  • the fault indication message points to the most unreachable node, that is, the node in the working cluster is the node of the detected node in the fault message.
  • the number of times the fault indication message is directed to the unreachable node can be at least as follows.
  • the received fault indication message can be directly counted to determine the number of times the node is the detected node.
  • each node in the cluster can maintain a global cluster node list, which can reflect the state of each node in the cluster, wherein the state of the node includes whether the node is reachable or not, and the unreachable node is unreachable. The number of times, the more unreachable nodes are unreachable, the more the fault indication message is pointed to.
  • each node in the cluster may receive at least one fault indication message; if it is determined that the detection node and the detected node are unidirectional fault detection in combination with the fault detection relationship of all the nodes in the cluster, the detected node corresponding to the indication message is marked. If it is unreachable once, for example, a set of unreachable nodes can be established, an unreachable node is added to the set of unreachable nodes, and the number of unreachable times is recorded; if the detection node and the detected node are in two-way fault detection, the fault is The detecting node and the detected node corresponding to the indication message are added in the unreachable node set, and the number of unreachable times is respectively recorded.
  • the detecting node or the detected node in the fault indication message is its own, add itself to the unreachable node set, and record the number of unreachable times, so that the unreachable nodes and all corresponding unreachables of all the nodes in the cluster are counted. The number of times is the same.
  • the determination of the unreachable node is described in more detail below.
  • each node in the cluster obtains fault detection topology information after the cluster starts normally.
  • the node in the cluster receives the fault indication message, and the fault indication message is used to indicate the unreachable event, that is, to indicate that the detected node is unreachable to the detected node, and the corresponding detected node in the global cluster node list is marked as unreachable.
  • the number of unreachables is increased by one.
  • each node in the cluster obtains fault detection topology information after the cluster starts normally.
  • the detection node and the detected node corresponding to the fault indication message are marked as unreachable in the global cluster node list, and each time the fault indication message is received, the detection node and the detected node are not available.
  • the number of times is 1 respectively.
  • the link is detected by the heartbeat when it is bad or bad.
  • the number of unreachable times is increased by one, and when it is good, it is detected as reachable, and when it is bad, it is detected as unreachable.
  • the node determines, according to the indication message, that the detected node and the detection checkpoint include themselves, the node marks itself as unreachable in the global cluster node list maintained by itself.
  • node 303 experiences a severe packet loss.
  • node 303 Take node 303 as an example:
  • each node in the cluster obtains fault detection topology information.
  • the nodes in the cluster receive the fault indication message.
  • Each node in the cluster is combined with the fault detection topology information to determine that if the link between the detection node and the detected node is a bidirectional link, the corresponding detection node and the detected node in the global cluster node list are marked as unreachable, and The number of unreachable times of the detected node and the detected node is increased by one; if the detected node and the detected node are unidirectional links, the corresponding detected node in the global cluster node list is marked as unreachable, and the detected node is unreachable.
  • the node determines, according to the indication message, that the detected node and the detection checkpoint include themselves, and the detection node and the detected node are bidirectional links, the node marks itself as not available in the global cluster node list maintained by itself; Up, increase the number of unreachables by 1.
  • node i when node i receives node j and is found to be unreachable by node k (i ⁇ k), the number of unreachable times of node j is increased by one; when node i receives node j and is found to be unreachable by node i, according to the fault detection topology information, Determining whether the node i and the node j are bidirectional links. If it is a bidirectional link, the node i is on the faulty link, and the unreachable times of the node i and the node j are respectively increased by one; if it is a unidirectional link, The number of unreachables of node j is increased by one.
  • the list of global cluster nodes maintained on each node can be in the form shown in Table 3:
  • the health status of the node may be the Phi value of the node, wherein the Phi value calculation principle is to learn a normal distribution function according to the history of the recent sampling n times (such as 1000 times), and then evaluate the current heartbeat health according to the current heartbeat response time. State, the greater the phi value, the less healthy the heartbeat. When the phi value exceeds the set threshold, the node is considered to be unreachable.
  • the health status of a node can also be determined by the number of unreachable times of the nodes in the global cluster node list. The more the number of unreachable nodes, the more unhealthy the node.
  • S260 Send an indication message to other nodes in the cluster, where the indication message is used to indicate the node to be deleted.
  • the control node After determining the node to be deleted, the control node needs to notify other nodes in the cluster, so that other nodes delete the node to be deleted, or mark the node to be deleted as deleted (for example, down state), and ensure the cluster. Can work normally.
  • the nodes in the cluster can determine whether they are control nodes. When they are the control nodes, steps S250-S260 are performed.
  • an indication message including an unreachable event is received, that is, there may be an unreachable node, and then the S250-S260 is continued, and the iteration is continued until An indication message including an unreachable event is received.
  • FIG. 3 is a schematic flowchart of another method for processing a fault in a node in a cluster according to an embodiment of the present disclosure.
  • the node can generate a topology map based on the fault detection information, and can perform fault processing in combination with the topology map.
  • the method specifically includes:
  • the fault detection topology information includes a fault detection relationship between the detected node and the detected node in the cluster.
  • One node in the cluster is fault detected by at least one other node in the cluster.
  • the fault detection relationship between each node in the cluster included in the topology information according to the fault detection of the nodes in the cluster. According to the fault detection relationship between the nodes, the topology map of the fault detection relationship of the cluster can be determined.
  • each node in the fault detection relationship topology map may be connected by a directed edge, that is, a fault detection relationship topology diagram reflects the direction of fault detection.
  • each node in the fault detection relationship topology map may also be connected by an undirected edge, that is, the fault detection relationship topology diagram does not reflect the direction of fault detection.
  • the fault detection relationship topology map may be a directed graph or an undirected graph. If the nodes in the cluster contain a two-way fault detection relationship and a one-way fault detection relationship, the directed graph may be a directed graph when determining the fault detection relationship topology map. In the fault detection relationship, the nodes in the topology map are connected by directed edges, and the nodes in the topology map are connected by undirected edges.
  • the fault detection relationship topology map may be determined.
  • the node may determine the fault detection relationship topology map of the nodes in the Akka cluster according to the received fault detection topology information based on the heartbeat detection mechanism.
  • the received fault detection topology information includes the node a being the detection node, the node b being the detected node, and the node b being the detection node and the node a being the fault detection topology information of the detected node, determining the node a and the node b is connected, and the node a and the node b are in a bidirectional detection relationship; if the received fault detection topology information includes only the information that the node a is the detection node and the node b is the detected node, it is determined that the node a is connected to the node b. And the node a and the node b are in a one-way detection relationship.
  • the connection relationship between the node a and the node b is If the fault detection topology information of the received cluster includes only the indication message that the node b is the detection node and the node a is the detected node, it is determined that the node a is connected to the node b, and the node a and the node b are single Detect the relationship to the fault.
  • the fault indication message indicates that the detected node is unreachable to the detected node, and the edge between the detected node and the detected node is deleted in the topology map based on the fault detection relationship between the nodes, and the topology map of the fault detection relationship after the deletion is determined.
  • the connected subgraphs are independent of each other, and there is no interconnected node between the connected subgraph and the connected subgraph. Each connected subgraph corresponds to one subcluster.
  • the detection relationship between the detection node and the detected node in the cluster is relatively rich, so that the detection relationship can reflect the network topology relationship to a certain extent, in other words, the fault detection relationship in the cluster. It can detect all the nodes in the cluster and judge the link state of the cluster according to the fault detection relationship topology map.
  • the S330 may be further configured to determine the faulty node and the faulty link in the cluster according to the fault detection topology information and the fault indication message, and delete the faulty node and the faulty link from the network topology diagram of the cluster, Determining the connected sub-graph of the deleted network topology map, and determining the sub-cluster according to the connected sub-graph of the deleted network topology map, where the network topology map includes network connection information between all nodes of the cluster.
  • the fault detection relationship topology map is generally not completely equivalent to the network topology map, that is, the fault detection relationship in the cluster does not span all the links in the cluster, so it is determined according to the network topology map.
  • the fault detection relationship topology map is generally not completely equivalent to the network topology map, that is, the fault detection relationship in the cluster does not span all the links in the cluster, so it is determined according to the network topology map.
  • all connections to the node are considered to be disconnected.
  • the link failure in the network topology diagram, only the fault corresponding to the link is considered to be broken. open. According to the determined fault node and fault edge, combined with the network topology map, it can be determined whether the cluster is split into sub-cluster that are mutually unreachable, and which sub-cluster is specifically split.
  • the connected fault component or the strong connected component in the connected fault map of the fault detection relationship topology map or the network topology map may be determined, and the connected component or the strong connected component is determined.
  • the corresponding sub-cluster is a working cluster.
  • the fault detection relationship topology map or the network topology map is a directed graph, after deleting the edge corresponding to the fault indication message, the largest one of the connected connected subgraphs is called a strong connected component. If the fault detection relationship topology map or the network topology map is an undirected graph, after deleting the edge corresponding to the fault indication message, the largest one of the connected connected subgraphs is called a connected component.
  • the fault detection relationship topology map or the network topology map corresponding to the edge corresponding to the fault indication message has multiple connected components or strong connected components
  • the seed node, the node running the main service, the health status, the available resource status, and the like may be further combined. Identify the working cluster. For details, refer to the related description in S240 in the foregoing embodiment.
  • the control node determines the working cluster, there may be a faulty link in the working cluster.
  • the faulty link does not cause cluster splitting, that is, the cluster is not divided into multiple sub-clusters.
  • the control node can also eliminate the faulty link in the working cluster by decision. For details, refer to the description in steps S250-S260 in the foregoing embodiment.
  • control node may also determine the node to be deleted in combination with the fault detection relationship topology map or the network topology map.
  • the fault detection relationship topology map of the nodes in the cluster includes an edge corresponding to the fault detection relationship between the nodes in the cluster
  • the network topology map includes an edge corresponding to the network connection relationship between the nodes.
  • the degree value of the node can be determined according to the fault detection relationship topology map or the network topology map.
  • the control node may delete the unreachable node with the highest degree of value from the cluster. Further, the control node may delete the unreachable node with the highest intrusion value formed by the fault edge in the unreachable node set from the cluster. Among them, the more faulty nodes involved in the unreachable node, the greater the probability that the node is a faulty node.
  • the inbound value formed by the faulty link of the unreachable node refers to the number of times the unreachable node is the node pointed by the fault link in the fault detection relationship topology map or the network topology map.
  • the unreachable node with the highest fault edge value may include multiple unreachable nodes with the same fault edge value.
  • the unhealthy node may be determined according to the health state of the unreachable node. Deleted node.
  • control node may also determine the orphan point formed by deleting the node and other nodes to be kicked out as nodes to be deleted, where the orphan point refers to a fault detection relationship topology diagram or network topology.
  • the nodes with the degree of ingress and out are both 0.
  • control node determines a fault detection relationship topology map or a network topology map in the cluster after deleting the node to be deleted, and determines all edge out degrees according to the fault detection relationship topology map or the network topology map in the cluster after deleting the node to be deleted.
  • the node with the indegree and the indegree is 0 is the node to continue to delete.
  • a link may be faulty, and only one node in the link (a node with the largest phi value or the most unreachable number) is kicked out, and the remaining nodes have no faults with other nodes. , you can run normally.
  • n links parallel, no intersection
  • each node kicks out a node (node with the largest phi value or unreachable number of times, a total of n nodes), and the remaining nodes do not appear if they link with other nodes
  • n is a positive integer
  • n links (the intersection exists) are faulty, and the intersection point is kicked out (the intersection point involves multiple priority kicks; the number of links is the same, the phi value is large or unreachable The node with the most number of times), if the remaining nodes have no faults with other nodes, they can run normally.
  • the node to be deleted may also be deleted first.
  • the working cluster may be determined.
  • the working cluster is a sub-cluster with the largest number of nodes and at least one seed node in the sub-cluster.
  • the subcluster is a cluster that does not contain nodes to be deleted or only nodes in the cluster that contain nodes to be deleted.
  • the topology information of the nodes in the cluster includes the fault detection relationship between the nodes in the cluster. According to the fault detection relationship, it is determined that the most unreachable node pointed to by the fault indication message is the node to be deleted. For example, the degree value of the unreachable node can be determined according to a topology map based on the fault detection relationship between the nodes.
  • the control node may delete the unreachable node with the highest intrusion value formed by the fault edge in the unreachable node set from the cluster. Further, the control node may notify each node in the cluster to delete the node to be deleted from the cluster.
  • the node with the highest intrusion value formed by the fault edge in the set of unreachable nodes may include unreachable nodes with the same inclusive value formed by multiple fault edges.
  • the most unhealthy node may be determined according to the health state of the unreachable node.
  • An unreachable node is the node to be deleted that is to be deleted.
  • the most unhealthy nodes include multiple, the node with the largest ip+port is determined to be the node to be deleted. In this way, the control node is protected because the general control node is the node with the smallest ip+port.
  • the deleted node forms at least one sub-cluster
  • the undeleted node also forms at least one sub-cluster.
  • Each node in the cluster can determine its own sub-cluster according to the topology map and the deleted node, and determine whether the cluster in which it is located is a working cluster. If the cluster in which it is located is a working cluster, each node in the working cluster is in the working cluster. Working status; if it is determined that the cluster in which it is located is not a working cluster, each node in the cluster can be powered off.
  • the number of reserved nodes is large, and the sub-cluster having at least one seed node is used as the working cluster.
  • the sub-cluster including at least one seed node has a unique one, it is determined that the sub-cluster has the largest number of nodes and one sub-cluster including at least one seed node is a working cluster.
  • the cluster with more seed nodes in the at least two sub-cluster is reserved as the working cluster. In other words, if there are multiple sub-clusters with the largest number of nodes and at least one seed node, it is determined that the sub-cluster with the largest number of nodes and the largest number of seed nodes included is the working cluster.
  • the cluster in which the ip+port minimum node is located in the seed node is reserved.
  • each node in the cluster when receiving the indication message sent by the leader node for indicating the node to be deleted, combines the fault detection relationship topology map or the network topology map of the node in the cluster to determine the sub-cluster in which the sub-cluster is located. And whether the subcluster is a working cluster.
  • the sub-cluster in which the sub-cluster is the most and includes the sub-cluster of the seed node according to the fault detection relationship topology map or the network topology map determine that the sub-cluster in which the sub-cluster is located is the working cluster; when the topology map or network topology is based on the fault detection relationship
  • the figure determines that the sub-cluster in which it is located is not the sub-cluster with the largest number of nodes, determine that the sub-cluster it is in is a non-working cluster.
  • the number of the sub-cluster nodes with the largest number of at least two nodes is the same, and both Contains seed nodes. Further, it is further determined whether the cluster in which the cluster is located is the cluster of the seed nodes included in the sub-cluster with the largest number of at least two nodes: if yes, it is determined that the sub-cluster in which the cluster is located is a working cluster; if not, it is determined that the sub-cluster in which the sub-cluster is located is not Working cluster.
  • the sub-cluster in which the sub-cluster is located it is one of the at least two sub-cluster with the largest number of nodes and the largest number of sub-cluster nodes, and the number of the at least two sub-cluster nodes and the number of seed nodes included are the same. Then, it is further determined whether the ip+port minimum node in the seed node is in the sub-cluster in which it is located: if yes, it is determined that the sub-cluster in which it is located is a working cluster; if not, it is determined that the sub-cluster in which it is located is a non-working cluster.
  • a 3-node cluster is taken as an example, and node 411 is a control node.
  • Each node is detected (monitored) by the other two nodes in the cluster, and the heartbeat message is sent every second between the detection nodes. If the link between node 412 and node 413 fails, node 412 informs node 411 that node 413 is unreachable at the time of fault detection, and node 413 informs node 411 that node 412 is unreachable.
  • both node 412 and node 413 are kicked out of the cluster by node 411, such that final node 411, node 412, and node 413 form three separate network partitions.
  • the node 412 finds that the node 413 is unreachable, then the delegation node 411 goes to ping the node 413, the node 411 can connect to the node 413, and the node 413 is marked as reachable; the node 413 finds the node. 412 is unreachable, then node 411 is pinged to node 412, and node 412 is found to be reachable, then node 412 is marked as reachable. As a result, no node is kicked out; the link failure of the node 412 and the node 413 continues, and the service data always has packet loss.
  • the fault detection information (for example, the fault detection relationship topology map) shown in FIG. 4 in the cluster is first determined, and then the faulty link is determined to form a faulty link.
  • the sub-cluster determines that the sub-cluster is a working cluster. Since there is still a faulty link in the sub-cluster, it is determined that the node is to be deleted.
  • the fault indication message points to the most node as the node 412 and the node 413, and further determines the node 412 and the node. One of the 413s is the node to be deleted. Node 412 or node 413 is deleted, and the post-working cluster is "node 411, node 413", or "node 411, node 412".
  • Node 501 considers node 503 unreachable and informs node 502; node 503 considers node 501 unreachable and informs node 502; node 502 considers node 503 unreachable and informs node 501; node 503 considers node 502 to be unreachable, And inform node 501.
  • each node is promoted to leader, kicking out other nodes, the cluster is divided into three independent clusters, and each cluster contains one node.
  • the leader node is on the faulty link, multiple leaders will occur, and multiple leader decisions will be inconsistent, which will also cause the cluster to be highly unavailable.
  • the fault detection information for example, the topology map based on the fault detection relationship
  • the faulty link is determined
  • Two sub-clusters are formed, one comprising node 501 and node 502 and the other comprising node 503. It is determined that the sub-cluster including the node 501 and the node 502 in the two sub-clusters is a working cluster, and the working cluster does not include a faulty link, and the cluster runs normally.
  • Node 603 considers node 604 to be unreachable, informing node 601, node 602, and node 605;
  • Node 605 considers node 604 to be unreachable, informing node 601, node 602, and node 603;
  • Node 604 considers node 603 to be unreachable, informing node 601, node 602, and node 605;
  • Node 404 considers node 605 to be unreachable, informing node 601, node 602, and node 603.
  • node 601 and node 602 consider node 603, node 604, node 605 to be unreachable, leader node (node 601) kicks out node 603, node 604 and node 605, remaining node 601 and node 602, forming Four network partitions: "Node 601, Node 602", “Node 603", “Node 604", and "Node 605" are not more than half of the number of network partition nodes.
  • the fault detection information (for example, the fault detection relationship topology map) shown in FIG. 6 in the cluster is first determined, and then the faulty link is determined to form two.
  • a sub-cluster one comprising node 601, node 602, node 603 and node 605, and the other comprising node 604. It is determined that the sub-cluster including the node 601, the node 602, the node 603, and the node 605 in the two sub-clusters is a working cluster, and the working cluster does not include a faulty link, and the cluster runs normally.
  • Node 703 considers node 704 to be unreachable, informing node 701, node 702, and node 705;
  • Node 705 considers node 704 to be unreachable, informing node 701, node 702, and node 703;
  • Node 704 considers node 703 to be unreachable, informing node 701 and node 702, node 701 and node 702 propagating node 703 for unreachable information to node 705;
  • Node 704 considers node 705 to be unreachable, informing node 701 and node 702 that node 705 is unreachable information to node 703.
  • node 701 and node 702 consider node 703, node 704, and node 705 to be unreachable.
  • the Leader node (node 501) kicks out node 703, node 704 and node 705, remaining node 701 and node 702, forming four network partitions: "node 701, node 702", “node 703", “node 704" and "node” 705”.
  • the number of nodes in any network partition is not more than half.
  • the fault detection information for example, the fault detection relationship topology map
  • the faulty link is determined to form a fault.
  • the sub-cluster determines that the sub-cluster is a working cluster. Since the faulty link still exists in the sub-cluster, the node is determined to be deleted.
  • the fault indication message points to the most node as the node 704, and further determines that the node 704 is the node to be deleted. .
  • the node 704 is deleted.
  • the post-working cluster includes a node 701, a node 702, a node 703, and a node 705, and the cluster operates normally.
  • node i delegates other nodes to ping node j, it must ensure that node i is connected to the delegated b nodes, and that the b nodes are connected to node j under normal circumstances, but the actual scenario is not necessarily met; The fault time is long, and the amount of data synchronized by gossip is large; the existing network fault problem cannot be solved.
  • the fault detection information (for example, the fault detection relationship topology map) shown in FIG. 8 in the cluster is first determined, and then the faulty link is determined to form a faulty link.
  • the sub-cluster is determined to be the working cluster. Since the faulty link still exists in the sub-cluster, the node is determined to be deleted.
  • the fault indication message points to the node 802, node 803, node 804, and node 805. Since the degrees are all the same, the most unhealthy node is deleted, that is, the node 804.
  • the post-working cluster includes the node 801, the node 802, and the node 805. The cluster node is halfway, and the cluster runs normally.
  • FIG. 9 is a schematic structural diagram of a fault processing apparatus according to an embodiment of the present disclosure.
  • the device 900 is suitable for use in a cluster.
  • the embodiments of the present application correspond to the foregoing method embodiments shown in FIG. 2 and FIG. 3, and can be understood by mutual reference.
  • the device 900 specifically includes:
  • the first obtaining unit 901 is configured to acquire fault detection topology information of the cluster, where one node in the cluster is detected by at least one other node in the cluster, where the fault detection topology information includes between the detected node and the detected node in the cluster. Fault detection relationship;
  • the second obtaining unit 902 is configured to acquire, from the detecting node, a fault indication message, where the fault indication message is used to indicate that the detecting node is unreachable to the detected node;
  • the processing unit 903 is configured to determine, according to the fault detection topology information, and the fault indication message, the sub-cluster in the cluster, where the nodes belonging to the different sub-cluster are mutually unreachable;
  • the processing unit 903 is further configured to determine a working cluster according to the sub-cluster of the cluster.
  • processing unit 903 is further configured to perform any one or more of the following manners:
  • the working cluster is determined based on the health status or the available resource status.
  • the processing unit 903 is further configured to: determine, according to the fault detection topology information, a fault detection relationship topology map between nodes, and delete an edge corresponding to the fault indication message from the fault detection relationship topology map, Determining a connected sub-graph of the deleted fault detection relationship topology map, and determining the sub-cluster according to the connected sub-graph.
  • the processing unit 903 is further configured to determine, according to the fault detection topology information, and the fault indication message, a faulty node and a faulty link in the cluster, from a network topology of the cluster. Deleting the faulty node and the faulty link in the figure, determining a connected sub-graph of the deleted network topology map, and determining the sub-cluster according to the connected sub-graph of the deleted network topology map, where the network topology map Contains network connection information between all nodes of the cluster.
  • processing unit 903 is further configured to: determine that the unreachable node in the unreachable node in the working cluster points to the most unreachable node as the node to be deleted;
  • the device may further include a sending unit 904, configured to send a first indication message to other nodes in the cluster, where the first indication message is used to indicate the node to be deleted.
  • a sending unit 904 configured to send a first indication message to other nodes in the cluster, where the first indication message is used to indicate the node to be deleted.
  • node to be deleted may be itself.
  • the processing unit 904 is further configured to determine that the fault indication message in the unreachable node in the working cluster points to the most unreachable node, and the one with the worst health state is the node to be deleted, where the health state is based on the node. The time of response to the heartbeat message is determined.
  • the first obtaining unit 901 is further configured to receive fault detection topology information sent by other nodes in the cluster; or calculate the fault detection topology information based on a preset rule.
  • FIG. 10 is a schematic structural diagram of a fault processing device according to an embodiment of the present disclosure.
  • the device 1000 is applicable to the cluster in the embodiment of the present invention, wherein the nodes in the cluster may be partially or completely run in the same fault processing device.
  • the nodes in the cluster may be virtual machines, and each fault processing device You can run one or more virtual machines. It can also be that each fault handling device corresponds to a node in a cluster.
  • the device 1000 can include a transceiver 1001, a processor 1002, and a memory 1003.
  • the processor 1002, the transceiver 1001, and the memory 1003 can be connected via the bus 1004 and complete communication with each other.
  • the transceiver 1001 is configured to interact with other nodes, and may include a receiving unit and a sending unit; the memory 1003 is configured to store programs and data.
  • the processor 1002 performs the functions of the fault handling device in the embodiment of the method of the present application by executing a program stored in the memory 1003.
  • the functions of the second obtaining unit and the transmitting module may be implemented by the transceiver in the embodiment of the present invention, and the function of the determining module is implemented by the processor in the embodiment of the present invention.
  • the function of the first acquiring unit is implemented by the transceiver in the embodiment of the present invention, when the first acquiring unit is configured to receive the fault detection topology information sent by the other nodes in the cluster.
  • the function of the first acquiring unit is implemented by the transceiver in the embodiment of the present invention.
  • processor 1002 described in this application may be a processor or a collective name of multiple processing elements.
  • the processor 1002 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more integrated systems configured to implement the embodiments of the present invention. Circuit.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • the memory 1003 may be a storage device or a collective name of a plurality of storage elements, and is used to store executable program code or parameters, data, and the like required for the operation of the access network management device.
  • the memory 1003 may include a random access memory (RAM), and may also include a non-volatile memory such as a disk memory, a flash memory, or the like.
  • the processor memory can be integrated into a processing circuit.
  • a cluster provided by an embodiment of the present invention includes at least one fault processing device described in any one of the foregoing embodiments.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) or a processor to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Environmental & Geological Engineering (AREA)
  • General Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Hardware Redundancy (AREA)

Abstract

本申请实施例涉及一种集群中节点的故障处理方法及设备。该方法包括获取集群的故障检测拓扑信息,该故障检测拓扑信息包含所述集群中所有节点之间的故障检测关系;获取故障指示消息,该故障指示消息用于指示检测节点到被检测节点不可达;根据故障检测拓扑信息,以及故障指示消息,确定集群中的子集群,其中,属于不同子集群中的节点互不可达;根据集群的子集群,确定工作集群。通过本申请实施例可以实现以较小的代价,最大程度的保留集群中的可用节点,提高了集群中可用节点的数量,确保了高可用性,降低了集群重启,业务不可用的概率,降低了故障恢复、业务迁移的开销。

Description

集群中节点的故障处理方法及设备
本申请要求于2017年07月12日提交中国专利局、申请号为201710564995.8、申请名称为“集群中节点的故障处理方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种集群中节点的故障处理方法及设备。
背景技术
随着分布式计算和云计算技术在信息领域的发展,传统通信技术(communication technology,CT)领域逐渐向信息通信技术(Information Communications Technology,ICT)转型和发展。ICT是信息技术与通信技术相融合而形成的一个新的概念和新的技术领域。CT向ICT的转型和发展过程中,不可避免地会遇到很多复杂且困难的问题需要解决,如CT领域中复杂网络的运营使得网络成本居高不下,在CT向ICT转型过程中,解决复杂网络问题是一个非常重要和具有挑战的问题。
为了推动CT向ICT的转型,SDN(Software Defined Network,软件定义网络)逐渐发展起来。SDN是Emulex网络一种新型网络创新架构,是网络虚拟化的一种实现方式。而SDN实现分布式或产品云化不可避免的需要解决分布式集群管理等问题。
SDN业务对集群节点的通信能力要求较高,而Akka集群的去中心化架构保证了节点之间的通信能力。因此一些公司采用了Akka集群来构建SDN控制器的集群能力。
Akka集群中节点分为leader节点和非leader节点。其中leader节点的功能是负责节点加入集群或踢出集群,其他功能与普通节点一致。
Akka的故障检测机制是,如果一条链路发生故障或某个节点随机丢包发生故障时,由于Akka无法感知具体是链路故障还是节点丢包故障,只是通过心跳检测到节点之间无法连通,为了保证可靠性,将链路两端的节点都踢出,从而导致高概率将正确的节点踢出。Akka的故障检测与处理机制,导致在网络差的情况下,链路或节点故障(挂掉或严重丢包)场景下,Akka集群的故障处理策略高概率会产生误踢,使得剩下的集群的节点数未过半,而业务很多时候需要集群节点过半才能正常运行,最终导致集群高概率不可用,增加了业务迁移、故障恢复开销,并且对用户使用SDN带来了严重的影响。
集群架构中,关于上述问题,业界技术往往从gossip算法本身或从故障检测角度考虑进行修改,业界一般采用的方法是当节点i与节点j之间心跳不通时,节点i委托其他节点去ping节点j,若委托节点中任意一个节点可与节点j之间ping通,则认为节点j为reachable。该方法增大了故障检测的时间开销和gossip算法同步的数据量。
发明内容
本申请实施例提供了一种集群中节点的故障处理方法及设备,可以实现在集群中链路出现故障时,准确定位到故障节点和故障链路,以较小的代价消除集群中的故障,降低了集群重启,业务不可用的概率。
第一方面,本申请实施例提供了一种集群中节点的故障处理方法。该方法包括:获取集群的故障检测拓扑信息,该集群中的一个节点被集群中的至少一个其他节点执行故障检测,故障检测通常通过检测节点向被检测节点发送检测报文(通常为心跳报文)实现,该故障检测拓扑信息包含所述集群中检测节点与被检测节点之间的故障检测关系;从检测节点获取故障指示消息,该故障指示消息用于指示检测节点到被检测节点不可达;根据故障检测拓扑信息,以及故障指示消息,确定集群中的子集群,其中,属于不同子集群中的节点互不可达;根据集群的子集群,确定工作集群,其中,本申请实施例中的集群可以为去中心化集群。获取集群的故障检测拓扑信息具体可以包括:接收所述集群中其他节点发送的故障检测拓扑信息;或者,基于预设规则进行推算故障检测拓扑信息。通过本申请实施例可以实现在集群中链路出现故障时,结合集群的故障检测拓扑信息以及故障指示消息,确定集群排除故障链路后形成的子集群,根据子集群确定工作集群,能够选择最优的子集群作为工作集群,例如,可以选择包含节点数最多的子集群,相对于直接删除故障链路两端的两个节点,或者通过额外的检测来确定要删除的节点。
容易理解的是,集群中任意一个节点既可以作为检测节点的角色,也可以作为被检测节点的角色,也可以在不同故障检测关系中同时承担检测节点和被检测节点的角色。
本申请实施例通过故障检测拓扑信息以及故障指示消息来确定子集群,可以以较小的代价,确定集群发生故障后所产生的可用子集群,最大程度的保留集群中的可用节点,提高了集群中可用节点的数量,确保了高可用性,降低了集群重启,业务不可用的概率,降低了故障恢复、业务迁移的开销。
在一种可选的实现方式中,在确定工作集群时,可以采用如下列举方式:
根据节点数量确定工作集群。例如,确定节点数量最多的子集群为工作集群;
根据种子节点确定工作集群。例如,确定种子节点最多的子集群为工作集群;
根据运行主业务的节点确定工作集群。例如,确定运行主业务的节点最多的子集群为工作集群;
根据健康状态或者可用资源状态确定工作集群。例如,确定健康状态满足预设条件的节点数量最多的子集群为工作集群;或者,确定可用资源状态满足预设条件的节点数量最多的子集群为工作集群。
此外,以上列举的方式可以多个考虑因素同时采用,例如,确定包含种子节点,且节点数量最多子集群为工作集群;或者,确定健康状态满足预设条件,且运行主业务的节点数量最多的集群未工作集群等
在一种可选地实现方式中,根据故障检测拓扑信息,以及所述故障指示消息,确定集群中的子集群时,根据故障检测拓扑信息所形成的故障检测关系拓扑图,标记出故障指示消息所对应的故障边,从而确定集群中的故障链路和/或故障节点,根据所确定的故障链路和故障节点确定节点互不可达的子集群。本申请实施例根据故障检测关 系拓扑图,基于图论,可以根据更全面信息,更合理的确定出故障链路和故障节点,以便更合理的确定出工作集群。
在一种可选地实现方式中,根据故障检测拓扑信息,以及故障指示消息,确定集群中的子集群时,根据集群中节点的网络连接关系所形成的网络拓扑图,标记出根据故障指示消息以及故障检测拓扑信息所确定的故障节点和故障链路,从而确定集群中的故障链路和故障节点,根据所确定的故障链路和故障节点确定节点互不可达的子集群。本申请实施例根据网络拓扑图,基于图论,可以根据更全面信息,更合理的确定出故障链路和故障节点,以便更合理的确定出工作集群。
在一种可选地实现中,若一节点被多个节点所检测,且根据故障指示消息所有检测该节点的检测节点到该节点均不可达,则该节点为故障节点;若一节点被多个节点所检测,且根据故障指示消息检测该节点的检测节点有部分到该节点不可达,而部分检测节点到该节点仍然可达,则不可达的检测节点到该节点之间的链路为故障链路。
在一个可选地实现中,前述根据所述集群的子集群,确定工作集群时,可以选择子集群中代价最小且保证能够正常工作的子集群作为工作集群,其中,子集群继承集群的属性越多,代价越小,该属性可以包括节点的数量、种子节点的数量以及运行的主业务等等。具体地可以根据所述集群的子集群,确定工作集群包括下述任意一种或多种方式:确定节点数量最多的子集群为工作集群;确定包含种子节点,且节点数量最多的子集群为工作集群;确定包含种子节点最多的子集群为工作集群;确定运行主业务的节点最多的子集群为工作集群;以及,基于健康状态或者可用资源状态确定工作集群。通过本申请实施例,可以实现选择工作集群的选择,以满足工作集群能够正常运行,最健康或者影响现有业务的代价最小等等条件。
在另一个可选地实现中,前述根据故障检测拓扑信息,以及故障指示消息,确定集群中的子集群包括:根据故障检测拓扑信息,确定基于节点之间的故障检测关系的拓扑图,从拓扑图中删除故障指示消息所对应的边,确定删除后的拓扑图的连通子图,根据连通子图确定子集群,另外,可以根据拓扑图的连通分量或者强连通分量对应的子集群确定为工作集群。通过本申请实施例可以实现结合能够体现集群汇总节点的故障检测关系的拓扑图来实现子集群的确定,通过拓扑图中的连接关系,可以更方便准确的确定出子集群。
在再一个可选地实现中,在确定工作集群后,该工作集群中可能还包括故障链路,也就是故障指示信息中的检测节点和被检测节点都包含在该工作集群中,此时,可以通过删除工作集群中的一个或多个节点来消除故障链路。其中,可以根据预设规则确定要删除的节点。具体地,该方法还包括:确定工作集群中的不可达节点中故障指示消息指向最多的不可达节点为要删除的节点;向集群中其他节点发送第一指示消息,该第一指示消息用于指示所述要删除的节点。通过本申请实施例,可以进一步地实现工作集群中故障链路的消除,通过确定故障指示消息指向最多的不可达节点为要删除的节点,相对于现有技术中直接删除链路两端的两个节点,或者通过额外的检测来确定要删除的节点,本申请实施例可以以较小的代价,最大程度的保留集群中的可用节点,提高了集群中可用节点的数量,确保了高可用性,降低了集群重启,业务不可用的概率,降低了故障恢复、业务迁移的开销。
本实施例的执行主体可以是集群中的任意一个节点,在一些实施例中,将该节点称之为控制节点。在一个可选地实现中,删除的节点可以为控制节点本身。通过本申请实施例,若控制节点自己在故障链路上,需要确定自己是否为要删除的节点,以保证每个节点看到的信息一致,提高故障节点和故障链路定位的准确性。
在另一个可选地实现中,前述确定所述工作集群中的不可达节点中故障指示消息指向最多的不可达节点为要删除的节点包括:确定所述工作集群中的不可达节点中故障指示消息指向最多的不可达节点,且健康状态最差的一个为要删除的节点,该健康状态基于节点对心跳报文的响应的时间所确定。通过本申请实施例,可以根据故障指示消息指向次数以及健康状态,以较小的代价删除不可达节点,降低业务不可用的概率。
第二方面,一种集群中节点的故障处理方法。该方法包括:获取集群的故障检测拓扑信息,集群中的一个节点被所述集群中的至少一个其他节点执行故障检测,所述故障检测拓扑信息包含所述集群中检测节点与被检测节点之间的故障检测关系;获取故障指示消息,该故障指示消息用于指示检测节点到被检测节点不可达;根据所述故障检测拓扑信息确定故障指示消息指向最多的不可达节点为要删除的节点。通过本申请实施例,控制节点仅需要接收集群中其他节点所发送的故障检测消息即可确定要删除的节点,相对于现有技术,控制节点无需对其他节点进行直接检测,从而减少了控制节点的负担。当集群中选择任意节点作为控制节点时,减小了对该节点的原有业务的影响。
在另一个可选地实现中,前述确定所述工作集群中的不可达节点中故障指示消息指向最多的不可达节点为要删除的节点包括:确定所述工作集群中的不可达节点中故障指示消息指向最多的不可达节点,且健康状态最差的一个为要删除的节点,该健康状态基于节点对心跳报文的响应的时间所确定。通过本申请实施例,可以根据故障指示消息指向次数以及健康状态,以较小的代价删除不可达节点,降低业务不可用的概率。
在一种可能的实现方式中,在确定的要删除的节点之后,根据故障检测拓扑信息,确定集群中删除的节点和未被删除的节点形成的子集群,其中,属于不同子集群中的节点互不可达,;根据集群的子集群,确定工作集群。通过本申请实施例可以实现在集群中链路出现故障时,根据拓扑图准确定位到故障节点和故障链路,以较小的代价删除集群中的不可达节点,降低了集群重启,业务不可用的概率,降低了故障恢复、业务迁移的开销。
在一些可能的实现方式中,在确定子集群时,可以根据第一方面所提供的方法确定子集群。
第三方面,本申请实施例提供了一种故障处理设备。该故障处理设备具有实现上述方法实际中节点的行为的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元。
在一个可选地实现中,该故障处理设备包括:收发器、处理器以及存储器,收发器用于与其他节点进行通信,存储器用于用户存储数据和程序。当故障处理设备运行时,处理器执行存储器存储的计算机执行指令,以使故障处理设备执行如第一方面以及第一方 面的各种可选方式中或者第二方面以及第二方面的各种可选方式中的故障处理方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,用于储存为上述故障处理设备所用的计算机可读指令,其包含用于执行上述第一方面以及可选地实现中或者第二方面以及第二方面的各种可选方式中所设计的程序。
第五方面,本申请实施例提供了一种计算机程序产品,用于储存为上述故障处理设备所用的计算机可读指令,其包含用于执行上述第一方面以及可选地实现中或者第二方面以及第二方面的各种可选方式中所设计的程序。
附图说明
图1为一种网络架构示意图;
图2为本申请实施例提供的一种集群中节点的故障处理方法信令交互示意图;
图3为本申请实施例提供的另一种集群中节点的故障处理方法流程示意图;
图4为本申请实施例提供的一个示例;
图5为本申请实施例提供的另一个示例;
图6为本申请实施例提供的又一个示例;
图7为本申请实施例提供的再一个示例;
图8为本申请实施例提供的再一个示例;
图9为本申请实施例提供的一种故障处理装置结构示意图;
图10为本申请实施例提供的一种故障处理设备结构示意图。
具体实施方式
本申请实施例提供的集群中节点的故障处理方法及设备,可以应用于去中心化集群应对节点故障的处理,具体地,集群中的节点对故障节点的删除,以及进一步地选择工作集群等等。
本申请实施例提供的集群中节点的故障处理方法适用于如图1所示的场景中。如图1所示,集群包括多个节点,其中,负责节点加入集群或踢出集群的节点为控制节点10,其他节点为普通节点20,例如,图1中的普通节点20包括普通节点201、普通节点202、普通节点203以及普通节点204等等。控制节点10为集群中包括的多个节点中符合预设规则的节点,例如,一般选择集群中的多个节点中IP+Port最小的可达(reachable)节点作为控制节点,如果控制节点出现故障或者其他情况的影响需要重新确定控制节点,则依据此规则在剩余的多个可达节点中选择控制节点。
以Akka集群为例,Akka集群中任何一个成员节点都有可能成为集群的Leader(即,控制节点),这是基于Gossip收敛(Convergence)过程得到的确定性结果。Leader只是一种角色,在各轮Gossip收敛过程中Leader是不断变化的。Leader的职责是使成员节点(即,普通节点)进入或离开集群。
一个成员节点开始于joining状态,一旦所有节点都看到了该新加入Akka集群的节点,则Leader会设置这个节点的状态为up。
如果一个节点安全离开Akka集群,可预期地它的状态会变为leaving状态,当Leader看到该节点为leaving状态,会将其状态修改为exiting,然后当所有节点看到该 节点状态为exiting,则Leader将该节点删除,状态修改为removed状态。
如果一个节点处于不可达(unreachable)状态,基于Gossip协议Leader是无法执行任何操作收敛(Convergence)到该节点的,所以unreachable状态的节点的状态是必须被改变的,它必须变成reachable状态或者down状态(即,从集群中删除)。如果down状态的节点想再次加入到Akka集群,它需要重新启动,并且重新加入集群(经由joining状态)。
为了集群能够正常运行,集群中的节点通过故障检测来识别故障,以便进行故障处理。其中,节点与节点之间的故障检测关系可包括单向检测和双向检测。两个节点之间只有其中一个检测的为单向检测,两个节点之间互相检测为双向检测。例如,结合图1所示,若控制节点10检测普通节点202,而普通节点202检测控制节点10,则控制节点10与202之间为单向检测。若控制节点10与普通节点201之间可以相互检测,则控制节点10与普通节点201之间为双向检测。
应该知道的是,图1所示的集群仅为示例,本申请实施例中的集群可以包括更多或更少的节点,集群中的节点也可以包括其他的连接方式。
本申请提供了集群中节点的故障处理方法、装置及设备。通过结合集群的故障检测拓扑信息以及故障指示消息,确定集群排除故障链路后形成的子集群,根据子集群确定工作集群,能够选择最优的子集群作为工作集群,例如,可以选择包含节点数最多的子集群,相对于直接删除故障链路两端的两个节点,或者通过额外的检测来确定要删除的节点,本申请实施例可以以较小的代价,最大程度的保留集群中的可用节点,提高了集群中可用节点的数量,确保了高可用性,降低了集群重启,业务不可用的概率,降低了故障恢复、业务迁移的开销。
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的一种集群中节点的故障处理方法、装置及设备进行详细地说明。
图2为本申请实施例提供的一种集群中节点的故障处理方法信令交互示意图。如图2所示该方法具体包括:
S210,控制节点获取集群的故障检测拓扑信息,该拓扑信息包含该集群中检测节点与被检测节点之间的故障检测关系。其中,集群中的一个节点被集群中的至少一个其他节点进行故障检测。
集群的故障检测拓扑信息可以根据节点检测机制确定。具体的,每个节点向控制节点发送故障检测拓扑消息,每个故障检测拓扑消息用于指示一个节点所故障检测的一个或多个或节点。控制节点接收集群中其他节点发送的多个故障检测拓扑消息,结合自身所故障检测的所有节点,可以得到集群中所有节点之间的故障检测关系。
另外,集群中的故障检测拓扑信息还可以根据预设规则推算。例如,对集群中的节点进行编号,并确定预设规则为每个节点检测本节点编号后两位的两个节点。则控制节点在确定了集群中的节点编号以及预设规则后,既可以推算出节点之间的故障检测关系,从而获取故障检测拓扑信息。
S220,控制节点从检测节点获取故障指示消息,该故障指示消息用于指示检测节点到被检测不可达。
节点在确定与其所检测的节点之间的链路不可达时,可以向与其连接的节点(包 括控制节点)发送故障指示消息,该消息中,检测节点为其本身,被检测节点为其所检测的节点。
集群中的每个节点上可以根据故障指示消息维护一个全局集群节点列表,该列表可以体现集群中各个节点的状态,其中,节点的状态包括节点是否被检测为不可达,以及被哪个节点标检测为不可达。
其中,不可达还可以认为是一个节点状态的一个额外标志,该标志用于标识集群中的节点不能与该节点正常通讯。
S230,控制节点根据故障检测拓扑信息,以及故障指示消息,确定集群中的子集群,其中,属于不同子集群中的节点互不可达。
由于故障检测消息仅能确定被检测节点不可达,而无法确定是由于检测节点到被检测节点之间的链路故障导致不可达,还是由于被检测节点故障导致的不可达。因此,在获取了基于全部节点之间的故障检测关系所构成的故障检测拓扑信息的基础上,根据故障指示消息,可以结合获得包含了当前的故障指示消息的故障检测拓扑图。基于图论的相关理论,可以确定在发生故障后集群分裂成的子集群。
在一种该实现方式中,可以在基于故障检测关系的拓扑信息形成故障检测关系拓扑图,该故障检测关系拓扑图中的每一条有向边表示检测节点与被检测节点之间的检测关系。当获取了故障指示消息后,将该故障指示消息所对应的检测节点与被检测节点之间的边标记为故障边,从而得到了包含当前故障边的故障检测关系拓扑图。基于该拓扑图,可以判断集群中的故障链路以及故障节点。例如,若指向同一节点的单向边中,既存在故障边,也存在正常的边,则所述故障边对应的被检测的链路为故障链路;若指向同一节点的单向边均为故障边,则该节点为故障节点。根据所确定的故障节点和故障边,即可判断集群是否分裂为了互不可达的子集群。
可以理解的,在一些情况下,集群中虽然出现了故障链路或者故障节点,但是集群中的所有节点仍然相互可达,即集群并未分裂为子集群。在这种情况下,所确定的子集群即为原集群本身。
S240,根据集群的子集群,确定工作集群。
在确定工作集群时,可依据三条原则:业务可用性、集群可扩展性和业务可靠性。
控制节点在排除出故障的链路后,确定集群中的所有节点构成一个子集群时,该子集群即为工作集群。
控制节点在排除出故障的链路后,确定集群中所有节点构成多个子集群时,可以根据下述一种或多种方式确定工作集群:
确定节点数量最多的子集群为工作集群;
确定包含种子节点,且节点数量最多的子集群为工作集群;
确定包含种子节点最多的子集群为工作集群;
确定运行主业务的节点最多的子集群为工作集群,主业务可以是指所有业务中的重点业务,也可以是指主备业务中的主业务;
以及,基于健康状态或者可用资源状态确定工作集群;等等。
需要说明的是,集群可分为种子节点和非种子节点,种子节点通过在配置文件中静态配置,是普通节点加入集群的联系点。非种子节点(普通节点),通过种子节点动 态加入集群。
举例来说,控制节点可以根据如下条件来确定工作集群。
保留节点数量多,且至少含有一个种子节点的子集群作为工作集群。若节点数最多且包含至少一个种子节点的子集群有唯一的一个,则确定子集群中节点数最多且包含至少一个种子节点的子集群为工作集群。若节点数量最多的子集群包括至少两个,该至少两个子集群节点数相同,且都包含种子节点,则保留至少两个子集群中种子节点多的子集群作为工作集群。换句话说,若节点数最多且包含至少一个种子节点的子集群有多个,则确定节点数最多且包含的种子节点的数量最多的子集群为工作集群。若节点数量最多且包含种子节点最多的子集群包括至少两个,该至少两个子集群节点数及包含的种子节点数相同,则确定种子节点中ip+port最小节点所在子集群为工作集群。
可以保证只有一个子集群存活,也就是仅保证工作集群中的节点工作,其他节点可以关机、下线或离开集群等等,以便集群能够正常的工作。
接下来,控制节点在确定工作集群后,工作集群中可能还存在故障链路,一般故障链路不会造成集群脑裂,也就是不会使集群分为多个子集群。控制节点还可以通过决策来消除工作集群中的故障链路,具体还可以包括如下步骤:
S250,确定工作集群中的不可达节点中故障指示消息指向最多的不可达节点为要删除的节点。
其中,故障指示消息指向最多的不可达节点即为,工作集群中的节点在故障消息中作为被检测节点的节点。
故障指示消息指向为不可达节点的次数至少可以有如下多种统计方式。
可以直接对接收到的故障指示消息进行统计,确定出节点作为被检测节点的次数。
还可以是,集群中的每个节点上可以维护一个全局集群节点列表,该列表可以体现集群中各个节点的状态,其中,节点的状态包括节点是否可达,还可以进一步统计不可达节点不可达的次数,不可达节点的不可达的次数越多,即可认为故障指示消息指向最多。
例如,集群中各个节点可以接收至少一个故障指示消息;若结合集群中所有节点的故障检测关系确定检测节点与被检测节点之间为单向故障检测,则将该指示消息对应的被检测节点标记为不可达一次,例如,可建立不可达节点集合,将不可达节点添加在不可达节点集合中,并记录不可达次数;若检测节点与被检测节点之间为双向故障检测,则将该故障指示消息对应的检测节点和被检测节点添加在不可达节点集合中,并分别记录不可达次数。若故障指示消息中的检测节点或被检测节点为其本身,则将其本身添加在不可达节点集合中,并记录不可达次数,以便集群中所有节点统计的不可达节点以及其对应的不可达次数一致。
下面对不可达节点的确定进行更详细的介绍。
方式一,若集群中只包含单向链路,在集群正常启动后,集群中的各个节点获取故障检测拓扑信息。集群中的节点接收故障指示消息,该故障指示消息用于指示unreachable事件,即,用于指示检测节点到被检测节点不可达,则将全局集群节点列表中对应的被检测节点标记为不可达,并且每被一个节点检测到一次不可达,则不可 达次数加1。
方式二,若集群中只包含双向链路,在集群正常启动后,集群中的各个节点获取故障检测拓扑信息。集群中节点接收故障指示消息,则在全局集群节点列表中将故障指示消息对应的检测节点和被检测节点标记为不可达,并且每接收到一次故障指示消息,则检测节点和被检测节点的不可达次数都分别1。例如,在丢包场景下,通过心跳检测链路时好时坏,坏的时候被检测到unreachable,不可达次数加1,好的时候又被检测到reachable,再坏的时候又被检测到unreachable,不可达次数再加1,会不停重复上述过程,直至问题解决。其中,节点根据指示消息确定被检测节点和检测检点包括其自身时,该节点在自身维护的全局集群节点列表中将其自身标记为不可达。
例如,结合图3所示,若节点303出现严重丢包。以节点303为例:
第一次:节点301被节点303发现为unreachable,由于节点301与节点303之间为双向链路,则节点303中unreachable次数统计情况如下表1所示:
表1
节点 unreachable次数
节点301 1
节点302 0
节点303 1
第二次:节点302被节点303标记为unreachable,由于节点302与节点303之间为双向链路,则节点303中unreachable次数统计情况如下表2所示:
表2
节点 unreachable次数
节点301 1
节点302 1
节点303 2
方式三,若集群中既包含双向链路又包含单向链路,在集群正常启动后,集群中的各个节点获取故障检测拓扑信息。集群中的节点接收故障指示消息。集群中的每个节点结合故障检测拓扑信息判断,若检测节点与被检测节点之间的链路为双向链路,将全局集群节点列表中对应的检测节点和被检测节点标记为不可达,以及检测节点和被检测节点的不可达次数加1;若检测节点与被检测节点为单向链路,则将全局集群节点列表中对应的被检测节点标记为不可达,对被检测节点的不可达次数加1。其中,节点根据指示消息确定被检测节点和检测检点包括其自身时,且检测节点和被检测节点之间为双向链路时,该节点在自身维护的全局集群节点列表中将其自身标记为不可达,自身的不可达次数加1。
例如,节点i收到节点j被节点k发现为unreachable时(i≠k),将节点j的unreachable次数加1;节点i收到节点j被节点i发现为unreachable时,根据故障检测拓扑信息,确定节点i与节点j之间是否为双向链路,若为双向链路,则节点i在故障链路上,将节点i和节点j的unreachable次数分别加1;若为单向链路,将节点j的unreachable 次数加1。
每个节点上维护的全局集群节点列表的形式可以如表3所示:
表3
节点 unreachable次数
节点1
节点2
节点3
另外,在确定要删除的节点时,还可以根据健康状态来确定,删除最不健康的一个。健康状态可以根据节点对心跳报文的响应时间所确定。例如,节点的健康状态可以是节点的Phi值,其中,Phi值计算原理是根据历史最近采样n次(如1000次)学习得到一个正态分布函数,然后根据当前心跳响应时间评估当前心跳的健康状态,phi值越大,心跳越不健康。当phi值超过设定的阈值时,则认为节点为unreachable。
节点的健康状态还可以通过全局集群节点列表中节点的不可达次数来确定,其中,节点的不可达次数越多,节点越不健康。
S260,向集群中其他节点发送指示消息,该指示消息用于指示要删除的节点。
控制节点在确定要删除的节点后,需要通知集群中的其他节点,以使得其他节点将该要删除的节点删除,或者将该要删除的节点标记为删除状态(例如,down状态),确保集群能够正常的工作。
另外,由于只有控制节点才有权限对集群中的节点进行删除的操作,所以集群中的节点可以判断自己是否为控制节点,在自己是控制节点时,再执行步骤S250-S260。
另外在本身请实施例中,若通过执行S210-S230后,还会接收到包括unreachable事件的指示消息,也就是,可能还存在不可达节点,则继续执行S250-S260,如此迭代,直至不会再收到包括unreachable事件的指示消息。
图3为本申请实施例提供的另一种集群中节点的故障处理方法流程示意图。节点可根据故障检测信息生成拓扑图,可以结合拓扑图进行故障处理。如图3所示该方法具体包括:
S310,获取集群的故障检测拓扑信息。其中,故障检测拓扑信息包含集群中检测节点与被检测节点之间的故障检测关系。其中,集群中的一个节点被集群中的至少一个其他节点进行故障检测。
故障检测拓扑信息的获取可以参见前述结合图2所示的实施例中的S210。此处不再赘述。
S320,从检测节点获取故障指示消息,该故障指示消息用于指示检测节点到被检测不可达。
故障指示消息的获取可以参见前述结合图2所示的实施例中的S220。此处不再赘述。
S330,根据故障检测拓扑信息,确定节点之间的故障检测关系拓扑图,从故障检测关系拓扑图中删除故障指示消息所对应的边,确定删除后的故障检测关系拓扑图的连通子图,根据连通子图确定子集群。
根据集群中节点的故障检测拓扑信息包括的集群中各个节点之间的故障检测关系。根据各个节点之间的故障检测关系可以确定集群的故障检测关系拓扑图。
在具体实施过程中,故障检测关系拓扑图中的各个节点之间可以通过有向边连接,也即故障检测关系拓扑图中体现故障检测的方向。或者,故障检测关系拓扑图中的各个节点之间也可以通过无向边连接,也即故障检测关系拓扑图中不体现故障检测的方向。
例如,若集群中节点与节点之间只包含双向故障检测关系或者仅包含单向故障检测关系,则在确定故障检测关系拓扑图时,即可以为有向图,也可以是无向图。若集群中节点之间即包含双向故障检测关系又包含单向故障检测关系,则在确定故障检测关系拓扑图时,可以为有向图。故障检测关系拓扑图中的节点通过有向边连接的为有向图,拓扑图中的节点通过无向边连接的为无向图。
另外,还可以在获取到集群的故障检测拓扑信息后,即确定故障检测关系拓扑图。例如,以Akka集群中的心跳检测机制为例,节点可根据接收到的基于心跳检测机制的故障检测拓扑信息,来确定Akka集群中节点的故障检测关系拓扑图。例如,若接收的故障检测拓扑信息中包括节点a为检测节点、节点b为被检测节点,以及节点b为检测节点、节点a为被检测节点的故障检测拓扑信息时,则确定节点a与节点b连接,且节点a与节点b之间为双向检测关系;若接收的故障检测拓扑信息中只包括节点a为检测节点、节点b为被检测节点的信息时,则确定节点a与节点b连接,且节点a与节点b之间为单向检测关系,若根据故障检测拓扑信息形成的故障检测关系拓扑图为有向图,节点a和节点b之间的连接关系对于节点b来说为入度;若接收的集群的故障检测拓扑信息中只包括节点b为检测节点、节点a为被检测节点的指示消息时,则确定节点a与节点b连接,且节点a与节点b之间为单向故障检测关系。
接下来,故障指示消息指示了检测节点到被检测节点不可达,在基于节点之间故障检测关系拓扑图中删除该检测节点到被检测节点之间的边,确定删除后的故障检测关系拓扑图的连通子图,每个连通子图相互独立,连通子图与连通子图之间不存在相互连接的节点。每个连通子图对应一个子集群。
在本发明实施例具体实施过程中,通常集群中的检测节点与被检测节点之间的检测关系比较丰富,使得检测关系能在一定程度上体现网络拓扑关系,换句话说,集群中故障检测关系能够检测集群中所有的节点,根据故障检测关系拓扑图即可实现集群的链路状态的判断。
在另一种实现方式中,S330还可以替换为根据故障检测拓扑信息,以及故障指示消息,确定集群中的故障节点和故障链路,从集群的网络拓扑图中删除故障节点和故障链路,确定删除后的网络拓扑图的连通子图,根据删除后的网络拓扑图的连通子图确定所述子集群,其中,网络拓扑图包含了集群的所有节点之间的网络连接信息。
在该种实现方式中,由于故障检测关系拓扑图一般不完全等同于网络拓扑图,也就是说,集群中的故障检测关系并没有遍及集群中的所有链路,所以在根据网络拓扑图在确定子集群时,首先需要确定是节点出现故障,还是链路出现故障。例如,若指向同一节点的单向边中,既存在故障边,也存在正常的边,则所述故障边对应的被检测的链路为故障链路;若指向同一节点的单向边均为故障边,则该节点为故障节点。 对于节点出现故障,在网络拓扑图中,即可认为与该节点的所有连接均断开,对于链路出现故障,在网络拓扑图中,即可认为仅有该故障连链路对应的边断开。根据所确定的故障节点和故障边,结合网络拓扑图即可判断集群是否分裂为了互不可达的子集群,以及具体分裂成了哪些子集群。
S340,根据集群的子集群,确定工作集群。
确定工作集群的原则以及方式可参见前述实施例中S240中的相关描述。
另外,在本发明实施例中,在确定工作集群时,可以确定删除后的故障检测关系拓扑图或网络拓扑图的连通子图中的连通分量或强连通分量,确定该连通分量或强连通分量对应的子集群为工作集群。
其中,若故障检测关系拓扑图或网络拓扑图为有向图,则删除故障指示消息对应的边后,形成的连通子图中最大的一个称为强连通分量。若故障检测关系拓扑图或网络拓扑图为无向图,则删除故障指示消息对应的边后,形成的连通子图中最大的一个称为连通分量。
若删除故障指示消息对应的边后的故障检测关系拓扑图或网络拓扑图具有多个连通分量或强连通分量,则可以进一步结合种子节点、运行主业务的节点,健康状态以及可用资源状态等等确定工作集群。具体可参见前述实施例中S240中的相关描述。
接下来,控制节点在确定工作集群后,工作集群中可能还存在故障链路,一般该故障链路不会造成集群脑裂,也就是不会使集群分为多个子集群。控制节点还可以通过决策来消除工作集群中的故障链路,具体还可以参见前述实施例中步骤S250-S260中的描述。
另外,控制节点还可以结合故障检测关系拓扑图或网络拓扑图确定要删除的节点。
具体地,集群中节点的故障检测关系拓扑图包括集群中各个节点之间的故障检测关系对应的边,在网络拓扑图中包括各个节点之间的网络连接关系对应的边。结合图论,根据故障检测关系拓扑图或网络拓扑图,可以确定节点的度值。控制节点可以从集群中删除度值最高的不可达节点,进一步的,控制节点可以从集群中删除不可达节点集合中故障边构成的入度值最高的不可达节点。其中,不可达节点涉及的故障链路越多,该节点为故障节点的概率越大。其中,不可达节点的故障链路构成的入度值是指,在故障检测关系拓扑图或网络拓扑图中不可达节点作为故障链路指向的节点的次数。
另外,故障边入度值最高的不可达节点可以包括多个故障边入度值相同的不可达节点,此时,可以进一步根据不可达节点的健康状态,确定最不健康的一个不可达节点为要删除的节点。
其中,控制节点在删除一个节点后,还可以将删除该节点后形成的孤点以及其他需踢出的节点确定为要删除的节点,其中,孤点是指在故障检测关系拓扑图或网络拓扑图中入度和出度都为0的节点。
具体地,控制节点确定删除要删除的节点后的集群中的故障检测关系拓扑图或网络拓扑图,根据删除要删除的节点后集群中的故障检测关系拓扑图或网络拓扑图确定所有边出度和入度都为0的节点为继续要删除的节点。
通过本申请实施例可以实现,一条链路出现故障,只踢出这条链路中的一个节点 (phi值大或unreachable次数最多的节点),剩下的节点若与其他节点链路未出现故障,则可以正常运行。n条链路(平行,无交集)出现故障,每条链路踢出一个节点(phi值大或unreachable次数最多的节点,共n个节点),剩下的节点若与其他节点链路未出现故障,则可以正常运行,n为正整数;n条链路(存在交集)出现故障,踢出交点(交点涉及故障链路数多优先踢;链路数相同的交点,选择phi值大或unreachable次数最多的节点),剩下的节点若与其他节点链路未出现故障,则可以正常运行。
在另一个实施例中,还可以先删除要删除的节点,在控制节点删除要删除的节点后,可以确定工作集群,该工作集群为子集群中节点数最多且包含至少一个种子节点的子集群,该子集群为不包含要删除的节点或者仅包含要删除的节点的集群中节点的集群。
集群中节点的拓扑信息包括集群中各个节点之间的故障检测关系,根据该故障检测关系,确定故障指示消息指向最多的不可达节点为要删除的节点。例如,可以根据基于节点之间的故障检测关系的拓扑图,可以确定不可达节点的度值。控制节点可以从集群中删除不可达节点集合中故障边构成的入度值最高的不可达节点,进一步的,控制节点可以通知集群中的各个节点从集群中删除要删除的节点。
另外,不可达节点集合中故障边构成的入度值最高的节点可以包括多个故障边构成的入度值相同的不可达节点,此时,可以进一步根据不可达节点的健康状态,确定最不健康的一个不可达节点为要删除的要删除的节点。
集群中多个故障边构成的入度值最高,最不健康的节点包括多个,则确定ip+port最大的节点为要删除的节点。以此,保护控制节点,因为一般控制节点为ip+port最小的节点。
其中,控制节点删除要删除的节点后,被删除的节点会形成至少一个子集群,未被删除的节点也会形成至少一个子集群。集群中的各个节点可根据拓扑图以及被删除的节点确定自己所在的子集群,并确定自己所在的集群是否为工作集群,如果自己所在的集群是工作集群,则该工作集群中的各个节点处于工作状态;如果确定自己所在的集群不是工作集群,则该集群中的各个节点可以处于关机状态。
在确定工作集群时,可依据三条原则:业务可用性、集群可扩展性和业务可靠性。
具体地:保留节点数量多,且至少含有一个种子节点的子集群作为工作集群。
若节点数最多且包含至少一个种子节点的子集群有唯一的一个,则确定子集群中节点数最多且包含至少一个种子节点的一个子集群为工作集群。
若节点数量最多的子集群包括至少两个,该至少两个子集群节点数相同,且都包含种子节点,则保留至少两个子集群中种子节点多的集群作为工作集群。换句话说,若节点数最多且包含至少一个种子节点的子集群有多个,则确定节点数最多且包含的种子节点的数量最多的一个子集群为工作集群。
若节点数量最多且包含种子节点最多的子集群包括至少两个,该至少两个子集群节点数及包含的种子节点数相同,则保留种子节点中ip+port最小节点所在集群。
保证只有一个子集群存活,以便集群能够正常的工作。
在具体实现时,集群中各个节点在接收到leader节点发送的用于指示要删除的节点的指示消息时,结合集群中节点的故障检测关系拓扑图或网络拓扑图,判断自己所 在的子集群,以及该子集群是否为工作集群。
当根据故障检测关系拓扑图或网络拓扑图确定自己所在的子集群为节点数最多且包含种子节点的子集群时,确定自己所在的子集群为工作集群;当根据故障检测关系拓扑图或网络拓扑图确定自己所在的子集群不是节点数最多的子集群时,确定自己所在的子集群为非工作集群。
当根据故障检测关系拓扑图或网络拓扑图确定自己所在的子集群,是节点数量最多的子集群的至少两个中的一个时,该至少两个节点数量最多的子集群节点数相同,且都包含种子节点。则进一步判断自己所在的集群是否为至少两个节点数量最多的子集群中包含的种子节点最多集群:若是,则确定自己所在的子集群为工作集群;若不是,确定自己所在的子集群为非工作集群。
当根据故障检测关系拓扑图或网络拓扑图确定自己所在的子集群,是节点数量最多且包含种子节点最多的至少两个子集群中的一个,该至少两个子集群节点数及包含的种子节点数相同,则进一步判断种子节点中ip+port最小节点是否在自己所在的子集群中:若是,则确定自己所在的子集群为工作集群;若不是,确定自己所在的子集群为非工作集群。
在一个示例中,如图4所示,以3节点集群为例,节点411为控制节点。每个节点被集群中其他两个节点检测(监控),检测节点之间每隔一秒发送一次心跳报文。若节点412与节点413之间的链路出现故障,在故障检测时,节点412告知节点411“节点413为unreachable”,节点413告知节点411“节点412为unreachable”。
依据Akka现有故障处理机制,节点412和节点413均会被节点411踢出集群,使得最终节点411、节点412和节点413形成三个独立网络分区。
对于委托方法的故障处理机制,在出现上述情况后,节点412发现节点413为unreachable,则委托节点411去ping节点413,节点411能连通节点413,于是节点413被标记为reachable;节点413发现节点412为unreachable,则委托节点411去ping节点412,发现节点412为reachable,则节点412被标记为reachable。由此,不会踢出任何节点;节点412和节点413的链路故障一直持续,业务数据会一直存在丢包现象。
基于本申请实施例,在如图4所示的情况出现时,首先确定集群中如图4所示的故障检测信息(例如,故障检测关系拓扑图),然后确定排除故障链路后,形成一个子集群,确定该子集群为工作集群,由于该子集群中依然存在故障链路,故继续确定要删除节点,其中,故障指示消息指向最多节点为节点412和节点413,进一步确定节点412和节点413中的一个为要删除的节点。删除节点412或节点413,后工作集群为“节点411,节点413”,或者,“节点411,节点412”。
在另一个示例中,如图5所示,若节点503出现严重丢包。
故障检测:节点501认为节点503为unreachable,并告知节点502;节点503认为节点501为unreachable,并告知节点502;节点502认为节点503为unreachable,并告知节点501;节点503认为节点502为unreachable,并告知节点501。
依据Akka现有故障处理机制,每个节点均升为leader,踢出其他节点,集群分为三个独立集群,每个集群包含一个节点。leader节点在故障链路上,会出现多leader 情况,多leader决策不一致,也会导致集群高概率不可用。
基于本申请实施例,在如图5所示的情况出现时,首先确定集群中如图5所示的故障检测信息(例如,基于故障检测关系的拓扑图),然后确定排除故障链路后,形成两个子集群,一个包括节点501和节点502,另一个包括节点503。确定该两个子集群中包括节点501和节点502的子集群为工作集群,该工作集群中不包含故障链路,集群正常运行。
在又一个示例中,如图6所示,以5节点集群为例,若节点604出现故障,严重丢包。
在故障检测过程中,设节点603和节点605监控到节点604为unreachable,会检测到如下情况:
节点603认为节点604为unreachable,告知节点601、节点602和节点605;
节点605认为节点604为unreachable,告知节点601、节点602和节点603;
节点604认为节点603为unreachable,告知节点601、节点602和节点605;
节点404认为节点605为unreachable,告知节点601、节点602和节点603。
依据Akka现有故障处理机制,节点601和节点602认为节点603、节点604、节点605为unreachable,Leader节点(节点601)踢出节点603、节点604和节点605,剩余节点601和节点602,形成四个网络分区:“节点601,节点602”,“节点603”,“节点604”以及“节点605”任一网络分区节点数未过半。
基于本申请实施例,在如图6所示的情况出现时,首先确定集群中如图6所示的故障检测信息(例如,故障检测关系拓扑图),然后确定排除故障链路后,形成两个子集群,一个包括节点601、节点602、节点603和节点605,另一个包括节点604。确定该两个子集群中包括节点601、节点602、节点603和节点605的子集群为工作集群,该工作集群中不包含故障链路,集群正常运行。
在再一个示例中,如图7所示,若节点704与节点703,以及节点704与节点705之间的链路出现故障。在故障检测过程中,会出现如下情况:
节点703认为节点704为unreachable,告知节点701、节点702和节点705;
节点705认为节点704为unreachable,告知节点701、节点702和节点703;
节点704认为节点703为unreachable,告知节点701和节点702,节点701和节点702将节点703为unreachable的信息传播给节点705;
节点704认为节点705为unreachable,告知节点701和节点702,节点701和节点702将节点705为unreachable的信息传播给节点703。
依据Akka故障处理机制,则节点701和节点702认为节点703、节点704、节点705为unreachable。Leader节点(节点501)踢出节点703、节点704和节点705,剩余节点701和节点702,形成四个网络分区:“节点701,节点702”,“节点703”,“节点704”以及“节点705”。任一网络分区节点数都未过半。
基于本申请实施例,在如图7所示的情况出现时,首先确定集群中如图7所示的故障检测信息(例如,故障检测关系拓扑图),然后确定排除故障链路后,形成一个子集群,确定该子集群为工作集群,由于该子集群中依然存在故障链路,故继续确定要删除节点,其中,故障指示消息指向最多节点为节点704,进一步确定节点704为 要删除的节点。删除节点704,后工作集群包括节点701、节点702、节点703和节点705,集群正常运行。
通过上述分析可知,Akka现有策略的不足。高概率误踢正常节点,导致在要求集群节点过半的业务情况下,集群高概率不可用。
基于委托方法存在不足。节点i委托其他节点去ping节点j时,必须要保证节点i与委托的b个节点是相通的,以及这b个节点与节点j在正常情况下是相通的,但实际场景不一定满足;发现故障时间长,且gossip同步数据量较大;不能解决现有的网络故障问题。
如图8所示,若节点802与节点803,以及节点804和节点805之间的链路出现故障。且,Phi(节点802)=0.8,phi(节点803)=0.85,phi(节点804)=0.9,phi(节点805)=0.82
基于本申请实施例,在如图8所示的情况出现时,首先确定集群中如图8所示的故障检测信息(例如,故障检测关系拓扑图),然后确定排除故障链路后,形成一个子集群,确定该子集群为工作集群,由于该子集群中依然存在故障链路,故继续确定要删除节点,其中,故障指示消息指向最多节点为节点802、节点803、节点804以及节点805,由于度值都相同,所以删除最不健康的节点,即节点804,删除节点804后,集群中还存在故障,节点801确定故障指示消息指向最多节点为节点802和节点803。由于度值都相同,所以删除最不健康的节点,即节点803。删除节点704,后工作集群包括节点801、节点802和节点805,集群节点过半,集群正常运行。
图9为本申请实施例提供的一种故障处理装置结构示意图。该装置900适用于集群中。其中,本申请实施例与前述结合图2和图3所示的方法实施例对应,可相互参照理解。该装置900具体包括:
第一获取单元901用于获取集群的故障检测拓扑信息,该集群中的一个节点被集群中的至少一个其他节点进行故障检测,该故障检测拓扑信息包含集群中检测节点与被检测节点之间的故障检测关系;
第二获取单元902用于从检测节点获取故障指示消息,该故障指示消息用于指示检测节点到被检测节点不可达;
处理单元903用于根据故障检测拓扑信息,以及故障指示消息,确定集群中的子集群,其中,属于不同子集群中的节点互不可达;
处理单元903还用于根据集群的子集群,确定工作集群。
具体地,处理单元903还用于执行下述任意一种或多种方式:
确定节点数量最多的子集群为工作集群;
确定包含种子节点,且节点数量最多的子集群为工作集群;
确定包含种子节点最多的子集群为工作集群;
确定运行主业务的节点最多的子集群为工作集群;
以及,基于健康状态或者可用资源状态确定工作集群。
可选地,处理单元903还用于,根据所述故障检测拓扑信息,确定节点之间的故障检测关系拓扑图,从所述故障检测关系拓扑图中删除所述故障指示消息所对应的边,确定删除后的故障检测关系拓扑图的连通子图,根据所述连通子图确定所述子集群。
在另一种实现方式中,处理单元903还用于,根据所述故障检测拓扑信息,以及 所述故障指示消息,确定所述集群中的故障节点和故障链路,从所述集群的网络拓扑图中删除所述故障节点和故障链路,确定删除后的网络拓扑图的连通子图,根据所述删除后的网络拓扑图的连通子图确定所述子集群,其中,所述网络拓扑图包含了所述集群的所有节点之间的网络连接信息。
进一步地,处理单元903还用于,确定所述工作集群中的不可达节点中故障指示消息指向最多的不可达节点为要删除的节点;
该设备还可以包括发送单元904,用于向集群中其他节点发送第一指示消息,所述第一指示消息用于指示所述要删除的节点。
另外,所述要删除的节点可以为其本身。
可选地,处理单元904还用于确定所述工作集群中的不可达节点中故障指示消息指向最多的不可达节点,且健康状态最差的一个为要删除的节点,所述健康状态基于节点对心跳报文的响应的时间所确定。
可选地,第一获取单元901还用于接收所述集群中其他节点发送的故障检测拓扑信息;或者,基于预设规则进行推算所述故障检测拓扑信息。
图10为本申请实施例提供的一种故障处理设备结构示意图。该设备1000适用于本发明实施例的集群中,其中,该集群中的节点可以部分或全部运行在同一个故障处理设备中,例如,该集群中的节点可以为虚拟机,每个故障处理设备中可以运行一个或多个虚拟机。也可以是每个故障处理设备对应一个集群中的节点。如图10所示,该设备1000可以包括收发器1001、处理器1002、存储器1003。该处理器1002、收发器1001和存储器1003可以通过总线1004连接并完成相互间的通信。收发器1001用于与其他节点进行交互,可包括接收单元和发送单元;存储器1003用来存储程序以及数据。处理器1002通过执行存储器1003中存储的程序,执行本申请方法实施例中故障处理设备的功能。
另外,在图9所描述的装置900中,第二获取单元以及发送模块的功能可由本发明实施例中的收发器来实现,确定模块的功能由本发明实施例中的处理器来实现。其中,第一获取单元用于接收所述集群中其他节点发送的故障检测拓扑信息时,该第一获取单元的功能由本发明实施例中的收发器来实现。或者,第一获取单元用于基于预设规则进行推算所述故障检测拓扑信息时,该第一获取单元的功能由本发明实施例中的收发器来实现。
需要说明的是,本申请中描述的处理器1002可以是一个处理器,也可以是多个处理元件的统称。例如,该处理器1002可以是中央处理器(Central Processing Unit,CPU),也可以是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。
存储器1003可以是一个存储装置,也可以是多个存储元件的统称,且用于存储可执行程序代码或接入网管理设备运行所需要参数、数据等。且存储器1003可以包括随机存储器(random access memory,RAM),也可以包括非易失性存储器(non-volatile memory),例如磁盘存储器,闪存(Flash)等。其中,处理器可存储器可集成为处理电路。
本发明实施例提供的一种集群,该集群包括至少一个前述任意一个实施例中所描 述的故障处理设备。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (16)

  1. 一种集群中节点的故障处理方法,其特征在于,所述方法包括:
    获取所述集群的故障检测拓扑信息,所述集群中的一个节点被所述集群中的至少一个其他节点执行故障检测,所述故障检测拓扑信息包含所述集群中检测节点与被检测节点之间的故障检测关系;
    从所述检测节点接收故障指示消息,所述故障指示消息用于指示所述检测节点到被检测节点不可达;
    根据所述故障检测拓扑信息,以及所述故障指示消息,确定所述集群中的子集群,其中,属于不同子集群中的节点互不可达;
    根据所述集群的子集群,确定工作集群。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述集群的子集群,确定工作集群包括下述任意一种方式:
    确定节点数量最多的子集群为工作集群;
    确定包含种子节点,且节点数量最多的子集群为工作集群,其中,所述种子节点为预配置的节点,非种子节点通过所述种子节点加入集群;
    确定包含种子节点最多的子集群为工作集群;
    确定运行主业务的节点最多的子集群为工作集群;
    以及,基于所述子集群中的节点的健康状态或者可用资源状态确定工作集群,其中节点的健康状态基于所述节点对检测报文的响应时间确定。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述故障检测拓扑信息,以及所述故障指示消息,确定所述集群中的子集群包括:
    根据所述故障检测拓扑信息,确定节点之间的故障检测关系拓扑图,从所述故障检测关系拓扑图中删除所述故障指示消息所对应的边,确定删除后的故障检测关系拓扑图的连通子图,根据所述删除后的故障检测关系拓扑图的连通子图确定所述子集群。
  4. 根据权利要求1或2所述的方法,其特征在于,所述根据所述故障检测拓扑信息,以及所述故障指示消息,确定所述集群中的子集群包括:
    根据所述故障检测拓扑信息,以及所述故障指示消息,确定所述集群中的故障节点和故障链路,从所述集群的网络拓扑图中删除所述故障节点和/或故障链路,确定删除后的网络拓扑图的连通子图,根据所述删除后的网络拓扑图的连通子图确定所述子集群,其中,所述网络拓扑图包含了所述集群的所有节点之间的网络连接信息。
  5. 根据权利要求1-4任意一项所述的方法,其特征在于,所述方法还包括,
    确定所述工作集群中的不可达节点中被最多故障指示消息指向的不可达节点为要删除的节点,所述不可达节点为故障指示消息所指向的被检测节点;
    向所述工作集群中其他节点发送第一指示消息,所述第一指示消息用于指示所述要删除的节点。
  6. 根据权利要求5所述的方法,其特征在于,所述确定所述工作集群中的不可达节点中故障指示消息指向最多的不可达节点为要删除的节点包括:
    确定所述工作集群中的不可达节点中被最多故障指示消息指向的不可达节点,且健康状态最差的一个为要删除的节点。
  7. 根据权利要求1-6任意一项所述的方法,其特征在于,所述获取所述集群的故障检测拓扑信息具体包括:
    接收所述集群中其他节点发送的故障检测关系,根据接收到的故障检测关系确定所述故障检测拓扑信息;
    或者,基于预设规则推算所述故障检测拓扑信息。
  8. 一种故障处理设备,所述设备适用于集群中,其特征在于,包括:
    第一获取单元,用于获取所述集群的故障检测拓扑信息,所述集群中的一个节点被所述集群中的至少一个其他节点执行故障检测,所述故障检测拓扑信息包含所述集群中检测节点与被检测节点之间的故障检测关系;
    第二获取单元,用于从所述检测节点获取故障指示消息,所述故障指示消息用于指示所述检测节点到被检测节点不可达;
    处理单元,用于根据所述故障检测拓扑信息,以及所述故障指示消息,确定所述集群中的子集群,其中,属于不同子集群中的节点互不可达;
    所述处理单元还用于根据所述集群的子集群,确定工作集群。
  9. 根据权利要求8所述的设备,其特征在于,所述处理单元还用于执行下述任意一种方式:
    确定节点数量最多的子集群为工作集群;
    确定包含种子节点,且节点数量最多的子集群为工作集群,其中,所述种子节点为预配置,非种子节点通过所述种子节点加入集群;
    确定包含种子节点最多的子集群为工作集群;
    确定运行主业务的节点最多的子集群为工作集群;
    以及,基于所述子集群中的节点的健康状态或者可用资源状态确定工作集群,其中节点的健康状态基于所述节点对检测报文的响应时间确定。
  10. 根据权利要求8或9所述的设备,其特征在于,所述处理单元还用于,根据所述故障检测拓扑信息,确定节点之间的故障检测关系拓扑图,从所述故障检测关系拓扑图中删除所述故障指示消息所对应的边,确定删除后的故障检测关系拓扑图的连通子图,根据所述删除后的故障检测关系拓扑图的连通子图确定所述子集群。
  11. 根据权利要求8或9所述的设备,其特征在于,所述处理单元还用于,根据所述故障检测拓扑信息,以及所述故障指示消息,确定所述集群中的故障节点和故障链路,从所述集群的网络拓扑图中删除所述故障节点和/或故障链路,确定删除后的网络拓扑图的连通子图,根据所述删除后的网络拓扑图的连通子图确定所述子集群,其中,所述网络拓扑图包含了所述集群的所有节点之间的网络连接信息。
  12. 根据权利要求8-11任意一项所述的设备,其特征在于,
    所述处理单元还用于,确定所述工作集群中的不可达节点中被最多故障指示消息指向的不可达节点为要删除的节点,所述不可达节点为故障指示消息所指向的被检测节点;
    还包括发送单元,用于向所述工作集群中其他节点发送第一指示消息,所述第一指示消息用于指示所述要删除的节点。
  13. 根据权利要求12所述的设备,其特征在于,所述处理单元还用于,确定所述 工作集群中的不可达节点中被最多故障指示消息指向的不可达节点,且健康状态最差的一个为要删除的节点。
  14. 根据权利要求8-13任意一项所述的设备,其特征在于,所述第一获取单元还用于接收所述集群中其他节点发送的故障检测关系,根据接收到的故障检测关系确定所述故障检测拓扑信息;
    或者,基于预设规则推算所述故障检测拓扑信息。
  15. 一种计算机可读存储介质,包括计算机可读指令,当计算机读取并执行所述计算机可读指令时,使得计算机执行如权利要求1-7任意一项所述的方法。
  16. 一种计算机程序产品,包括计算机可读指令,当计算机读取并执行所述计算机可读指令,使得计算机执行如权利要求1-7任意一项所述的方法。
PCT/CN2018/082663 2017-07-12 2018-04-11 集群中节点的故障处理方法及设备 WO2019011018A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP18832228.3A EP3627767B1 (en) 2017-07-12 2018-04-11 Fault processing method and device for nodes in cluster
CA3066853A CA3066853A1 (en) 2017-07-12 2018-04-11 Intra-cluster node troubleshooting method and device
US16/732,749 US11115263B2 (en) 2017-07-12 2020-01-02 Intra-cluster node troubleshooting method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710564995.8 2017-07-12
CN201710564995.8A CN109257195B (zh) 2017-07-12 2017-07-12 集群中节点的故障处理方法及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/732,749 Continuation US11115263B2 (en) 2017-07-12 2020-01-02 Intra-cluster node troubleshooting method and device

Publications (1)

Publication Number Publication Date
WO2019011018A1 true WO2019011018A1 (zh) 2019-01-17

Family

ID=65001087

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/082663 WO2019011018A1 (zh) 2017-07-12 2018-04-11 集群中节点的故障处理方法及设备

Country Status (5)

Country Link
US (1) US11115263B2 (zh)
EP (1) EP3627767B1 (zh)
CN (1) CN109257195B (zh)
CA (1) CA3066853A1 (zh)
WO (1) WO2019011018A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590573A (zh) * 2021-06-25 2021-11-02 济南浪潮数据技术有限公司 一种分布式集群的请求的路由方法和装置
CN113805788A (zh) * 2020-06-12 2021-12-17 华为技术有限公司 一种分布式存储系统及其异常处理方法和相关装置
EP4047481A4 (en) * 2019-11-27 2023-01-04 Huawei Technologies Co., Ltd. METHOD AND DEVICE FOR RECOMMENDED TROUBLESHOOTING ACTIONS AND STORAGE MEDIA

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11410174B2 (en) * 2018-08-07 2022-08-09 International Business Machines Corporation Custom blockchain for IoT devices
CN110247818A (zh) * 2019-05-21 2019-09-17 中国平安财产保险股份有限公司 一种数据监控方法、装置、存储介质和服务器
CN111737079B (zh) * 2020-05-20 2024-04-09 山东鲸鲨信息技术有限公司 一种集群网络的监控方法和装置
CN111698132B (zh) * 2020-06-12 2022-03-01 北京字节跳动网络技术有限公司 用于控制集群中心跳事件的方法、装置、设备和介质
CN111786887A (zh) * 2020-06-30 2020-10-16 中国工商银行股份有限公司 由控制设备执行的数据转发方法、装置、计算设备和介质
CN112367191B (zh) * 2020-10-22 2023-04-07 深圳供电局有限公司 一种5g网络切片下服务故障定位方法
CN112468317A (zh) * 2020-11-05 2021-03-09 苏州浪潮智能科技有限公司 一种集群拓扑更新方法、系统、设备及计算机存储介质
CN112445809A (zh) * 2020-11-25 2021-03-05 浪潮云信息技术股份公司 一种分布式数据库节点存活状态检测模块及方法
CN112468596B (zh) * 2020-12-02 2022-07-05 苏州浪潮智能科技有限公司 一种集群仲裁方法、装置、电子设备及可读存储介质
US11995562B2 (en) * 2020-12-03 2024-05-28 International Business Machines Corporation Integrating documentation knowledge with log mining for system diagnosis
CN112910981B (zh) * 2021-01-27 2022-07-26 联想(北京)有限公司 一种控制方法及装置
US20220385488A1 (en) * 2021-05-31 2022-12-01 Nutanix, Inc. System and method for reconciling consumption data
US20230030168A1 (en) * 2021-07-27 2023-02-02 Dell Products L.P. Protection of i/o paths against network partitioning and component failures in nvme-of environments
US12045667B2 (en) * 2021-08-02 2024-07-23 International Business Machines Corporation Auto-split and auto-merge clusters
CN113660339B (zh) * 2021-08-18 2023-08-04 北京百度网讯科技有限公司 用于去中心化集群的方法和装置
CN113794593B (zh) * 2021-09-14 2023-05-26 新华三信息安全技术有限公司 一种集群故障处理方法及装置
CN114285722B (zh) * 2021-12-10 2023-08-25 苏州浪潮智能科技有限公司 一种分布式存储集群节点通信告警方法、装置、设备及介质
US12068907B1 (en) * 2023-01-31 2024-08-20 PagerDuty, Inc. Service dependencies based on relationship network graph
CN116545766B (zh) * 2023-06-27 2023-12-15 积至网络(北京)有限公司 基于链式安全的验证方法、系统及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041924A1 (en) * 2010-05-04 2013-02-14 International Business Machines Corporation Event impact analysis
CN104378232A (zh) * 2014-11-10 2015-02-25 东软集团股份有限公司 主备集群组网模式下的脑裂发现、恢复方法及装置
CN104796273A (zh) * 2014-01-20 2015-07-22 中国移动通信集团山西有限公司 一种网络故障根源诊断的方法和装置
CN106209400A (zh) * 2015-04-30 2016-12-07 华为技术有限公司 一种定位故障的方法和设备

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6438705B1 (en) * 1999-01-29 2002-08-20 International Business Machines Corporation Method and apparatus for building and managing multi-clustered computer systems
US7327683B2 (en) * 2000-03-16 2008-02-05 Sri International Method and apparatus for disseminating topology information and for discovering new neighboring nodes
US8195976B2 (en) * 2005-06-29 2012-06-05 International Business Machines Corporation Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance
US8634289B2 (en) * 2009-12-31 2014-01-21 Alcatel Lucent Efficient protection scheme for MPLS multicast
US9098392B1 (en) * 2011-04-29 2015-08-04 Symantec Corporation Systems and methods for changing fencing modes in clusters
CN102594596B (zh) * 2012-02-15 2014-08-20 华为技术有限公司 识别集群网络中可用分区的方法、装置及集群网络系统
US9176799B2 (en) * 2012-12-31 2015-11-03 Advanced Micro Devices, Inc. Hop-by-hop error detection in a server system
US9292371B1 (en) * 2013-12-11 2016-03-22 Symantec Corporation Systems and methods for preventing failures of nodes in clusters
US9450852B1 (en) * 2014-01-03 2016-09-20 Juniper Networks, Inc. Systems and methods for preventing split-brain scenarios in high-availability clusters
CN105450717A (zh) * 2014-09-29 2016-03-30 中兴通讯股份有限公司 集群脑裂处理方法和装置
US10320703B2 (en) * 2015-09-30 2019-06-11 Veritas Technologies Llc Preventing data corruption due to pre-existing split brain
CN106254103B (zh) * 2016-07-28 2019-08-16 北京国电通网络技术有限公司 一种rtmp集群系统可动态配置方法及装置
CN106656682B (zh) 2017-02-27 2019-10-25 网宿科技股份有限公司 集群心跳检测方法、系统及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041924A1 (en) * 2010-05-04 2013-02-14 International Business Machines Corporation Event impact analysis
CN104796273A (zh) * 2014-01-20 2015-07-22 中国移动通信集团山西有限公司 一种网络故障根源诊断的方法和装置
CN104378232A (zh) * 2014-11-10 2015-02-25 东软集团股份有限公司 主备集群组网模式下的脑裂发现、恢复方法及装置
CN106209400A (zh) * 2015-04-30 2016-12-07 华为技术有限公司 一种定位故障的方法和设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3627767A4

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4047481A4 (en) * 2019-11-27 2023-01-04 Huawei Technologies Co., Ltd. METHOD AND DEVICE FOR RECOMMENDED TROUBLESHOOTING ACTIONS AND STORAGE MEDIA
US11743113B2 (en) 2019-11-27 2023-08-29 Huawei Technologies Co., Ltd. Fault rectification operation recommendation method and apparatus, and storage medium
CN113805788A (zh) * 2020-06-12 2021-12-17 华为技术有限公司 一种分布式存储系统及其异常处理方法和相关装置
CN113805788B (zh) * 2020-06-12 2024-04-09 华为技术有限公司 一种分布式存储系统及其异常处理方法和相关装置
CN113590573A (zh) * 2021-06-25 2021-11-02 济南浪潮数据技术有限公司 一种分布式集群的请求的路由方法和装置

Also Published As

Publication number Publication date
US20200145283A1 (en) 2020-05-07
EP3627767B1 (en) 2022-02-23
CA3066853A1 (en) 2019-01-17
CN109257195A (zh) 2019-01-22
EP3627767A4 (en) 2020-05-13
EP3627767A1 (en) 2020-03-25
US11115263B2 (en) 2021-09-07
CN109257195B (zh) 2021-01-15

Similar Documents

Publication Publication Date Title
WO2019011018A1 (zh) 集群中节点的故障处理方法及设备
US10560315B2 (en) Method and device for processing failure in at least one distributed cluster, and system
US9106548B2 (en) Network fault localization
WO2016107369A1 (zh) 一种管理数据传输通道的方法及装置
US10728099B2 (en) Method for processing virtual machine cluster and computer system
US10277454B2 (en) Handling failure of stacking system
US10868709B2 (en) Determining the health of other nodes in a same cluster based on physical link information
WO2016029749A1 (zh) 一种通信故障的检测方法、装置及系统
US10680944B2 (en) Arbitrating mastership between redundant control planes of a virtual node
CN108632099B (zh) 一种链路聚合的故障检测方法及装置
WO2016082443A1 (zh) 集群仲裁方法和多集群配合系统
CN111176888B (zh) 云存储的容灾方法、装置及系统
US10530634B1 (en) Two-channel-based high-availability
WO2017032223A1 (zh) 虚拟机部署方法及装置
WO2015168947A1 (zh) 路径切换的方法和设备
US10367711B2 (en) Protecting virtual computing instances from network failures
US11418382B2 (en) Method of cooperative active-standby failover between logical routers based on health of attached services
US11108623B2 (en) Rapid owner selection
JP5503600B2 (ja) 故障管理システムおよび故障管理方法
US20130262664A1 (en) Computer system and subsystem management method
US10374941B2 (en) Determining aggregation information
US9654363B2 (en) Synthetic loss measurements using session numbers
US20140325279A1 (en) Target failure based root cause analysis of network probe failures
US10516625B2 (en) Network entities on ring networks
CN112751755B (zh) 一种设备虚拟化方法、装置、系统、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18832228

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3066853

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2018832228

Country of ref document: EP

Effective date: 20191219

NENP Non-entry into the national phase

Ref country code: DE