WO2016192408A1 - 集群系统中节点的故障检测方法和装置 - Google Patents

集群系统中节点的故障检测方法和装置 Download PDF

Info

Publication number
WO2016192408A1
WO2016192408A1 PCT/CN2016/073606 CN2016073606W WO2016192408A1 WO 2016192408 A1 WO2016192408 A1 WO 2016192408A1 CN 2016073606 W CN2016073606 W CN 2016073606W WO 2016192408 A1 WO2016192408 A1 WO 2016192408A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
heartbeat
nodes
neighbor
neighboring
Prior art date
Application number
PCT/CN2016/073606
Other languages
English (en)
French (fr)
Inventor
胡琳
伍湘平
彭佩星
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2016192408A1 publication Critical patent/WO2016192408A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route

Definitions

  • the embodiments of the present invention relate to communication technologies, and in particular, to a fault detection method and apparatus for a node in a cluster system.
  • a distributed cluster system In a distributed cluster system, it usually includes a central node and multiple common nodes. When a central node or a common node fails, it will have a great impact on the reliability of the distributed cluster system. Therefore, how to effectively perform the node The fault detection is very important.
  • FIG. 1 is a schematic diagram of a method for detecting a fault of a node in the prior art.
  • a common node B, C, D, E
  • the central node M
  • the central node (M) can also periodically send heartbeat messages to the common nodes (B, C, D, E) to inform the normal node of the role of the central node and whether it is in a normal state, once the ordinary node (B) If the heartbeat packet sent by the central node (M) is not received during the detection period, the central node (M) is determined to be faulty. At this time, the normal node initiates the operation of re-electing the central node. If the election is successful, the ordinary node will perceive the new central node and send the heartbeat message to the new central node, and the cluster will perform fault detection.
  • Embodiments of the present invention provide a fault detection method and apparatus for a node in a cluster system, which are used by In the prior art, the node fault detection needs to be detected through multiple heartbeat cycles, resulting in a long period of node fault detection, thereby improving the efficiency of node fault detection.
  • an embodiment of the present invention provides a method for detecting a fault of a node in a cluster system, including:
  • the first node determines whether the first heartbeat message sent by the second node is received in the preset time; the first node is a neighbor node of the second node, and the first heartbeat message is the second a heartbeat message sent by the node to each neighboring node of the second node in parallel, the number of all neighboring nodes of the second node is two or more; the preset time is greater than or equal to one heartbeat period, and Less than two heartbeat cycles;
  • the first node If the first node does not receive the first heartbeat message sent by the second node, the first node sends all the neighbor nodes of the second node other than the first node.
  • the other neighboring node sends a request message, where the request message is used to query whether the other neighboring node receives the first heartbeat message;
  • the first node receives a response message that is sent by the other neighboring node and carries a receiving state, where the receiving state is used to indicate whether the first heartbeat packet is received.
  • the first node determines that the second node is faulty.
  • the method further includes:
  • the first node generates first voting information, and receives second voting information sent by each of the other neighboring nodes, where the first voting information includes a node identifier corresponding to the node that is elected by the first node;
  • the two voting information includes a node identifier corresponding to the node elected by the neighbor node that sends the second voting information;
  • the first node counts the number of votes obtained by each node of all the elected nodes according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighboring nodes. And the node with the highest number of votes is the third node; the third node is a node that replaces the second node and sends a heartbeat message to all neighbor nodes of the third node in parallel; the third node All neighbor nodes including the third A neighbor node of the node itself and a neighbor node of the second node.
  • the method further includes:
  • Determining, by the first node, that the at least one of the other neighboring nodes receives the first heartbeat message according to the received state carried in the response message sent by each of the other neighboring nodes Determining, by the first node, that a link between the node that does not receive the first heartbeat packet and the second node is faulty; the node that does not receive the first heartbeat packet includes the A node that does not receive the first heartbeat message among the first node and the other neighboring nodes.
  • the method further includes:
  • the first node re-determines a neighbor node of the first node according to a neighboring node of the third node and a node other than the third node among the other neighboring nodes.
  • an embodiment of the present invention provides a method for detecting a fault of a node in a cluster system, where the method includes:
  • the second node sends the first heartbeat message to the first node and the other neighbor nodes in parallel;
  • the first node is a neighbor node of the second node, and the other neighbor nodes are all neighbor nodes of the second node a node other than the first node, the number of the other neighbor nodes is one or more;
  • the first node determines whether the first heartbeat message is received within a preset time; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods;
  • the first node In a case that the first node does not receive the first heartbeat message, the first node sends a request message to each of the other neighboring nodes, where the request message is used to query each of the other Whether the neighbor node receives the first heartbeat message;
  • the first node receives a response message that is sent by each of the other neighboring nodes and carries a receiving state, where the receiving state is used to indicate whether the first heartbeat packet is received.
  • the node further includes:
  • the first node generates first voting information, and receives second voting information sent by each of the other neighboring nodes, where the first voting information includes a node identifier corresponding to the node that is elected by the first node;
  • the two voting information includes a node identifier corresponding to the node elected by the neighbor node that sends the second voting information;
  • the first node counts the number of votes obtained by each node of all the elected nodes according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighboring nodes. And the node with the highest number of votes is the third node; the third node is a node that replaces the second node and sends a heartbeat message to all neighbor nodes of the third node in parallel; the third node All neighbor nodes include the neighbor node of the third node itself and the neighbor node of the second node.
  • the method further includes:
  • the first node determines that the link between the node that has not received the first heartbeat packet and the second node is faulty; the node that does not receive the first heartbeat packet includes the A node that does not receive the first heartbeat message among the first node and the other neighbor nodes.
  • the second possible aspect of the second aspect, the second possible implementation manner of the second aspect, the third possible implementation manner of the second aspect further includes:
  • the first node re-determines a neighbor node of the first node according to a neighboring node of the third node and a node other than the third node among the other neighboring nodes.
  • an embodiment of the present invention provides a fault detection apparatus for a node in a cluster system, including:
  • a determining module configured to determine whether a first heartbeat message sent by the second node is received within a preset time; the first node is a neighbor node of the second node, and the first heartbeat message is the a heartbeat message sent by the second node to each neighboring node of the second node in parallel, the number of all neighboring nodes of the second node being two or more; the preset time being greater than or equal to one heartbeat period And less than two heartbeat cycles;
  • the determining module determines that the receiving module does not receive the first heartbeat message sent by the second node
  • a sending module configured to send, to all neighbor nodes of the second node, a request message, except for the first node, the request message is used to query whether the other neighbor node receives the a heartbeat message;
  • the receiving module is further configured to receive a response message that is sent by the other neighboring node and that carries a receiving state, where the receiving state is used to indicate whether the first heartbeat packet is received;
  • a determining module configured to determine, according to the receiving state carried in the response message sent by each of the other neighboring nodes that is received by the receiving module, whether the other neighboring node does not receive the first heartbeat packet ;
  • the determining module is further configured to determine that the second node is faulty, if the determining module determines that the other neighboring nodes do not receive the first heartbeat message.
  • the method further includes:
  • a generating module configured to generate first voting information, where the first voting information includes a node identifier corresponding to the node elected by the first node;
  • the receiving module is further configured to receive second voting information that is sent by each of the other neighboring nodes, where the second voting information includes a node identifier corresponding to a node that is elected by the neighboring node that sends the second voting information;
  • the determining module is further configured to collect, according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighboring nodes, statistics obtained by each node of all the elected nodes.
  • the number of votes, and the node with the highest number of votes is the third node;
  • the third node is a node that replaces the second node and sends a heartbeat message to all neighbor nodes of the third node in parallel; All neighbor nodes of the third node include the neighbor node of the third node itself and the neighbor node of the second node.
  • the determining module is further configured to determine that a link between the node that does not receive the first heartbeat packet and the second node is faulty; and the node that does not receive the first heartbeat packet includes A node that does not receive the first heartbeat message among the first node and the other neighboring nodes.
  • the determining module is further configured to re-determine the neighbor node of the first node according to the neighboring node of the third node and the node other than the third node of the other neighboring nodes.
  • an embodiment of the present invention provides a fault detection system for a node in a cluster system, including a first node, a second node, and other neighbor nodes, where the first node is a neighbor node of the second node,
  • the other neighboring nodes are the nodes other than the first node among all the neighboring nodes of the second node, and the number of the other neighboring nodes is more than one, including:
  • the second node is configured to send a first heartbeat message to the first node and the other neighbor nodes in parallel;
  • the first node is configured to determine whether the first heartbeat packet is received within a preset time; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods;
  • the first node is further configured to separately send a request message to each of the other neighboring nodes, where the request message is used to query each Whether the other neighboring node receives the first heartbeat message; and the first node is further configured to receive, by each of the other neighboring nodes, a response message carrying a receiving state, where the receiving state is used for Indicates whether the first heartbeat message is received;
  • the first node determines, by the first node, that the other neighboring nodes have not received the first heartbeat packet according to the received state carried in the response message sent by each of the other neighboring nodes.
  • the first node is further configured to determine that the second node is faulty.
  • the method further includes:
  • the first node is further configured to:
  • first voting information includes a node identifier corresponding to the node that is elected by the first node
  • second voting information includes sending a node identifier corresponding to a node elected by a neighbor node of the second voting information
  • the node with the largest number is the third node;
  • the third node is a node that replaces the second node and sends heartbeat messages to all neighbor nodes of the third node in parallel; all neighbors of the third node
  • the node includes a neighbor node of the third node itself and a neighbor node of the second node.
  • the first node is further configured to determine that a link between the node that does not receive the first heartbeat packet and the second node is faulty; and the node that does not receive the first heartbeat packet includes: The neighbor node of the first heartbeat message is not received by the first node and the other neighboring nodes.
  • the first node is further configured to re-determine the neighbor node of the first node according to the neighboring node of the third node and the node other than the third node of the other neighboring nodes.
  • the first node determines whether the first heartbeat packet sent by the second node is received within a preset time, where the first node is the second node.
  • the neighboring node is a heartbeat message sent by the second node to each neighboring node of the second node in parallel, and the number of all neighboring nodes of the second node is two or more; the preset time is greater than Or equal to one heartbeat period, and less than two heartbeat periods; the first node asks if it does not receive the first heartbeat message itself Whether the other neighboring nodes of the second node receive the first heartbeat message, and if it is determined that the other neighboring nodes of the second node have not received the first heartbeat packet, determining that the second node is faulty .
  • the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods, when the fault is detected by using the technical solution provided by the present invention, it is avoided in the prior art that multiple heartbeat cycles are required to detect whether the node is faulty.
  • the phenomenon can shorten the cycle of fault detection, thereby improving the efficiency of node fault detection.
  • FIG. 1 is a schematic structural diagram of a method for detecting a fault of a node in a cluster system in the prior art
  • Embodiment 1 is a schematic flowchart of Embodiment 1 of a method for detecting a fault of a node in a cluster system according to the present invention
  • FIG. 3 is a schematic diagram 1 of an adjacency relationship between nodes in a cluster system
  • FIG. 4 is a schematic diagram 2 of an adjacency relationship between nodes in a cluster system
  • Embodiment 2 is a schematic flowchart of Embodiment 2 of a method for detecting a fault of a node in a cluster system according to the present invention
  • 6A is a schematic diagram of an adjacency relationship between nodes before a node failure is detected in a cluster system
  • 6B is a schematic diagram of re-determining the relationship between nodes after detecting a node failure in the cluster system
  • Embodiment 7 is a schematic flowchart of Embodiment 3 of a method for detecting a fault of a node in a cluster system according to the present invention
  • Embodiment 8 is a schematic flowchart of Embodiment 4 of a method for detecting a fault of a node in a cluster system according to the present invention
  • FIG. 9 is a schematic structural diagram of Embodiment 1 of a fault detecting apparatus for a node in a cluster system according to the present invention.
  • FIG. 10 is a schematic structural diagram of Embodiment 1 of a fault detection system for a node in a cluster system according to the present invention
  • FIG. 10 is a schematic structural diagram of Embodiment 1 of a node according to the present invention
  • FIG. 11 is a schematic structural diagram of Embodiment 1 of a node according to the present invention.
  • the embodiment of the present invention is applicable to a cluster system, and is specifically applicable to a scenario of fault detection of a node in a distributed cluster system.
  • the distributed cluster system includes at least two nodes, which may be, for example, a computer.
  • the node in the cluster system in this embodiment is different from the existing cluster system in that: in the cluster system of this embodiment, all nodes are given the same function, that is, all nodes have The ability to receive the heartbeat packet and the heartbeat packet is the same. Therefore, in the cluster system of this embodiment, there is no distinction between the central node and the normal node, and the central node does not need to manage the normal node.
  • the technical solutions of the following embodiments are all introduced by using a computer as an execution subject.
  • FIG. 2 is a schematic flowchart diagram of Embodiment 1 of a method for detecting a fault of a node in a cluster system according to the present invention.
  • the method according to the embodiment of the present invention is applicable to a distributed cluster system. This embodiment is described by taking a computer as an execution subject as an example. As shown in FIG. 2, the method in this embodiment may include:
  • Step 201 The first node determines whether the first heartbeat message sent by the second node is received in the preset time; the first node is a neighbor node of the second node, and the first heartbeat message is the second node in parallel A heartbeat packet sent by each neighboring node of the two nodes, the number of all neighboring nodes of the second node is two or more; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods.
  • the second node determines the first node according to the preset rules in the cluster system according to the information of all the nodes in the cluster system, where the first node is the second node.
  • a neighbor node is considered, and the neighbor node of the second node is a node associated with the second node.
  • 3 is a schematic diagram 1 of the relationship between nodes in a cluster system.
  • node E can determine that there are four nodes according to the rules of the cluster system according to the rules of all nodes.
  • Neighbor nodes are nodes A, B, C, and D, respectively.
  • the first node may be any one of nodes A, B, C, and D.
  • the first node detects whether the second node is faulty by determining whether the first heartbeat message sent by the second node is received within a preset time. It should be noted that the second node sends a heartbeat message to all its neighbor nodes in parallel. Therefore, the first heartbeat message is the second node in parallel to each neighbor of the second node at the same time. A heartbeat message sent by the node. In addition, the second node may send the first heartbeat message to all its neighbor nodes in parallel according to the heartbeat period. Therefore, the first node may determine whether the time is greater than or equal to one heartbeat period and less than two heartbeat periods. The first heartbeat message sent to the second node.
  • the heartbeat period is 5 s, that is, the second node sends a heartbeat message to all its neighbor nodes in parallel every 5 s.
  • the first node will It is determined whether the first heartbeat message sent by the second node is received in a time greater than or equal to 5s and less than 10s.
  • the heartbeat period can be set according to the experience or the actual situation. The specific value of the heartbeat period is not limited in this embodiment.
  • the second node may periodically send the first heartbeat packet to the first node through a physical network, but when the fault is detected based on the single physical network, the network fails, for example, the management plane network fails, and the service When the plane network is normal, it is often impossible to define whether the second node in the cluster system fails or the link between the second node and the first node fails, or the second node and the first node fail simultaneously. The detection result of the fault is not accurate.
  • the first heartbeat packet is sent by using the at least two networks.
  • the first heartbeat packet for example, the management plane and the service plane, may be sent through the dual plane.
  • the first heartbeat packet can also be sent through the three planes, for example, the management plane, the service plane, and the signaling plane.
  • the first heartbeat packet is sent by using multiple physical networks to detect whether the node is faulty, which can improve the accuracy of detection. It should be noted that, if the number of physical networks is at least two, the at least two physical networks are isolated from each other, so that if a shared device fails due to the sharing of some devices between multiple networks, Causes no communication between nodes The phenomenon is conducive to improving the accuracy of the test.
  • Step 202 If the first node does not receive the first heartbeat message sent by the second node, the first node sends a request message to all neighbor nodes except the first node of all the neighbor nodes of the second node. The request message is used to query whether other neighbor nodes receive the first heartbeat message.
  • the cluster system cannot increase the ordinary node indefinitely because of the limitation of the performance of the central node, so that the scalability of the cluster system is affected.
  • the first node if the first node does not receive the first heartbeat message sent by the second node within the preset time, it may be determined that the second node may be faulty. Since the second node is the first heartbeat message sent to all its neighbor nodes in parallel, the first node sends a request message to the neighbor nodes of the second node, other than itself, to inquire about other Whether the neighbor node receives the first heartbeat message sent by the second node.
  • the first node may send a request message to other neighbor nodes of the second node, and the non-neighbor node of the second node will also not The heartbeat packet is sent to the second node, so that the number of heartbeat packets processed by the second node can be reduced, thereby reducing the burden on the second node, and the scalability of the cluster system is better.
  • FIG. 4 is a schematic diagram 2 of an adjacency relationship between nodes in a cluster system.
  • neighbor nodes of node E have X, A, D, C, and G, and node E will be in each heartbeat cycle. Invoke a heartbeat message to all its neighbor nodes X, A, D, C, and G. Assume that node E is the second node and node A is the first node. If a node is in a heartbeat period, the first node A is not.
  • the first node A will send a request message to the other neighbor nodes X, D, C, and G to query whether the nodes X, D, C, and G receive the first A heartbeat message.
  • Step 203 The first node receives a response message that is sent by another neighboring node and carries a receiving state, where the receiving state is used to indicate whether the first heartbeat packet is received.
  • the other neighboring node after receiving the request message sent by the first node, the other neighboring node carries the receiving state of the first heartbeat packet in the response message and sends it to the first node.
  • Step 204 The response sent by the first node according to each other neighbor node received In the case that the receiving state carried in the information determines that the other neighboring nodes have not received the first heartbeat message, the first node determines that the second node is faulty.
  • each other neighbor node after receiving the request message sent by the first node, each other neighbor node returns a response message carrying the receiving status to the first node, and the first node sends according to each other neighbor node received.
  • the response message carrying the receiving status determines whether the other neighboring node receives the first heartbeat message, and determines that the other neighboring nodes have not received the first heartbeat message sent by the second node. The second node has failed.
  • the neighbor relationship between the nodes is bidirectional, that is, the nodes forming the neighbor relationship can send heartbeat messages to each other. Therefore, all neighbor nodes of the second node perform step 201-204 separately. .
  • the first node determines whether the first heartbeat packet sent by the second node is received in the preset time, where the first node is the neighbor of the second node.
  • the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is two or more; the preset time is greater than or equal to a heartbeat period, and less than two heartbeat periods; if the first node does not receive the first heartbeat message, the first node asks whether other neighbor nodes of the second node receive the first heartbeat message, and determines the If the other neighbor nodes of the second node have not received the first heartbeat message, it is determined that the second node has failed.
  • the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods, when the fault is detected by using the technical solution provided by the present invention, it is avoided in the prior art that multiple heartbeat cycles are required to detect whether the node is faulty.
  • the phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.
  • FIG. 5 is a schematic flowchart diagram of Embodiment 2 of a method for detecting a fault of a node in a cluster system according to the present invention.
  • the method in this embodiment may include:
  • Step 501 The first node generates first voting information, and receives second voting information sent by each other neighboring node.
  • the first voting information includes a node identifier corresponding to the node elected by the first node, and the second voting information includes sending the second voting information.
  • Node ID Correspondence of nodes elected by neighbor nodes of voting information Node ID.
  • the neighbor node of the second node determines that the second node has failed, all neighbor nodes need to recalculate the respective neighbor nodes.
  • any one of the neighbor nodes of the second node may be used as the first node, and the first node needs to generate the first voting information, where the first voting information includes the node identifier corresponding to the node elected by the first node and the voting basis.
  • the first node also receives the second voting information sent by each of the other neighboring nodes, where the second voting information includes the node identifier corresponding to the node elected by the neighboring node that sends the second voting information, and the voting basis.
  • the voting basis is related to various factors, such as: load situation, node number size, node cache age and node network bandwidth, etc., for example, the first node can determine which node bears the least load. The node identifier corresponding to the node with the smallest load is carried in the first voting information and sent to other neighbor nodes. Similarly, other neighbor nodes can also send the second voting information to the first node in a similar manner.
  • Step 502 The first node counts the number of votes obtained by each node of all the elected nodes according to the node identifier in the first voting information and the node identifier in the second voting information sent by each other neighboring node, and The node with the highest number of votes is the third node; the third node is the node that replaces the second node and sends the heartbeat message to all the neighbor nodes of the third node in parallel; all the neighbor nodes of the third node include the third node itself Neighbor node and neighbor node of the second node.
  • the first node after receiving the second voting information sent by each other neighboring node, the first node, according to the node identifier in the first voting information generated by itself and the node identifier in the received second voting information, the third node can be determined.
  • the number of votes obtained by each node of all the elected nodes may be counted according to the node identifier carried in the first voting information and the second voting information, and the number of votes obtained by each node of the selected nodes may be obtained.
  • the most nodes are the third nodes.
  • the third node is configured to take over the neighbor node of the failed second node, that is, to take over the relationship between the second node and other nodes, so the third node will replace the second node and parallel to the third node.
  • All neighbor nodes send nodes of the heartbeat message, wherein all neighbor nodes of the third node include the neighbor nodes of the second node in addition to the neighbor nodes of the third node.
  • Step 503 The first node is divided according to the neighbor node of the third node and other neighbor nodes. A node other than the three nodes re-determines the neighbor node of the first node.
  • the first node after all the neighbor nodes of the second node determine the third node by voting, if the first node is the third node, the first node will take over the neighbor relationship of the second node, and other neighbors.
  • the node may re-determine the respective neighbor nodes according to the neighbor relationship after the first node takes over the neighbor node of the second node; if the first node is not the third node, the first node will re-determine the phase to the third node.
  • the neighbor node of the third node and other nodes of the other neighbor nodes except the third node are re-determined.
  • FIG. 6A is a schematic diagram of an adjacency relationship between nodes before a node failure is detected in a cluster system
  • FIG. 6B is a schematic diagram of re-determining an adjacent relationship between nodes after detecting a node failure in the cluster system.
  • the node E is the second node and the node A is the first node. After the first node A determines that the second node E has failed, the first node A will generate the first voting information and receive the nodes respectively.
  • the second voting information sent by X, D, C, and G the first node A determines the third node according to the node identifier in the first voting information and the node identifier in the second voting information, so that the third node replaces the second node.
  • the node and in parallel, sends a heartbeat message to all neighbor nodes of the third node. As shown in FIG. 6B, if the first node A is determined to be the third node by voting, the first node A replaces the second node, and the heartbeat message is sent to all the neighbor nodes of the first node A in parallel.
  • the first node A needs to re-determine its neighbor nodes through other neighbor nodes X, D, C, and G, and the nodes X, D, C, and G wait for the first node A to determine its own neighbor node.
  • the neighbor nodes determined according to the first node A re-determine the respective neighbor nodes.
  • the first node determines whether the first heartbeat packet sent by the second node is received within a preset time, where the first node is a neighbor node of the second node.
  • the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is two or more; the preset time is greater than or equal to one
  • the heartbeat period is less than two heartbeat periods; if the first node does not receive the first heartbeat message, the first node asks whether other neighbor nodes of the second node receive the first heartbeat message, and determines the first If the other neighbor nodes of the two nodes have not received the first heartbeat message, it is determined that the second node has failed.
  • the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat weeks Therefore, when the fault detection is performed by using the technical solution provided by the present invention, the phenomenon that the node needs to be detected through multiple heartbeat cycles can be avoided in the prior art, and the cycle of fault detection is shortened, thereby improving node fault detection. s efficiency.
  • the respective neighbor nodes are re-determined, and then the fault detection is continued, thereby improving the accuracy of the fault detection.
  • the first node determines, according to the received state carried in the response message sent by each of the other neighboring nodes, that the at least one other neighboring node receives the first heartbeat message, the first node determines The link between the node that has not received the first heartbeat message and the second node fails.
  • the first node does not receive the first heartbeat message sent by the second node, and sends a request message to each other node to query whether each other neighbor node receives the first heartbeat message, if The response message sent by each other node determines that at least one other neighbor node receives the first heartbeat message, and the first node may determine that the second node is normal, and may be the second node and the first node, and The link between the node that has not received the first heartbeat packet and the first node is faulty, and the node that does not receive the first heartbeat packet includes the first heartbeat that is not received by the first node and other neighboring nodes. Neighbor node of the message.
  • the first node determines that the first heartbeat packet is not received, because the first node determines that the at least one other neighbor node receives the first heartbeat packet.
  • the link between the node and the second node fails, making the fault detection more comprehensive.
  • FIG. 7 is a schematic flowchart diagram of Embodiment 3 of a method for detecting a fault of a node in a cluster system according to the present invention.
  • the method according to the embodiment of the present invention is applicable to a distributed cluster system.
  • the computer is still taken as an execution subject as an example.
  • the method in this embodiment may include:
  • Step 701 The second node sends the first heartbeat message to the first node and the other neighboring nodes in parallel, where the first node is a neighboring node of the second node, and the other neighboring nodes are all the neighboring nodes of the second node except the first node. Outside the node, the number of other neighbor nodes is more than one.
  • the second node may be based on the letter of the node included in the cluster system.
  • the information is determined according to a preset rule in the cluster system, where the first node is any neighbor node of the second node, and the neighbor node of the second node is a node associated with the second node. .
  • the second node After determining all the neighbor nodes, the second node sends the first heartbeat message to the first node and other neighbor nodes in parallel.
  • Step 702 The first node determines whether the first heartbeat message is received within a preset time; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods.
  • the second node may send the first heartbeat message to all its neighbor nodes in parallel according to the heartbeat period. Therefore, the first node may determine that the heartbeat period is greater than or equal to one heartbeat period and less than two heartbeat periods. Whether the first heartbeat message sent by the second node is received in the time. For example, if the heartbeat period is 5 s, that is, the second node sends a heartbeat message to its neighbor node in parallel every 5 s. For the first heartbeat message sent by the second node in the 5th s, the first node will judge. Whether the first heartbeat message sent by the second node is received in a time greater than or equal to 5s and less than 10s.
  • the heartbeat period can be set according to the experience or the actual situation. The specific value of the heartbeat period is not limited in this embodiment.
  • the second node may periodically send the first heartbeat packet to the first node through a physical network, but when the fault is detected based on the single physical network, the network fails, for example, the management plane network fails, and the service When the plane network is normal, it is often impossible to define whether the second node in the cluster system fails or the link between the second node and the first node fails, or the second node and the first node fail simultaneously. The detection result of the fault is not accurate.
  • the first heartbeat packet is sent by using the at least two networks.
  • the first heartbeat packet for example, the management plane and the service plane, may be sent through the dual plane.
  • the first heartbeat packet can also be sent through the three planes, for example, the management plane, the service plane, and the signaling plane.
  • the first heartbeat packet is sent by using multiple physical networks to detect whether the node is faulty, which can improve the accuracy of detection. It should be noted that, if the number of physical networks is at least two, the at least two physical networks are isolated from each other, so that if a shared device fails due to the sharing of some devices between multiple networks, The phenomenon that the nodes cannot communicate normally can help improve the accuracy of detection.
  • Step 703 In the case that the first node does not receive the first heartbeat message, the first node sends a request message to each of the other neighboring nodes, and the request message is used to query each other neighbor. Whether the node receives the first heartbeat message.
  • the first node may not receive the first heartbeat message sent by the second node within the preset time, it may be initially determined that the second node may be faulty. Since the second node is the first heartbeat message sent to all its neighbor nodes in parallel, the first node sends a request message to the neighbor nodes of the second node, other than itself, to inquire about other Whether the neighbor node receives the first heartbeat message sent by the second node.
  • Step 704 The first node receives a response message that is sent by each other neighboring node and carries a receiving state, where the receiving state is used to indicate whether the first heartbeat packet is received.
  • each other neighboring node after receiving the request message sent by the first node, each other neighboring node carries the receiving state of the first heartbeat message in the response message and sends it to the first node.
  • Step 705 The first node determines that the second node is faulty if the first node determines that the other neighboring nodes have not received the first heartbeat message according to the receiving state carried in the received response message.
  • each other neighbor node after receiving the request message sent by the first node, each other neighbor node returns a response message carrying the receiving status to the first node, and the first node sends according to each other neighbor node received.
  • the response message carrying the receiving status determines whether the other neighboring node receives the first heartbeat message, and determines that the other neighboring node does not receive the first heartbeat message sent by the second node, and then determines the second The node has failed.
  • the second node sends the first heartbeat message to the first node and other neighbor nodes in parallel, and the first node determines whether the first time is received in the preset time.
  • a first heartbeat message sent by the two nodes where the first node is a neighbor node of the second node, and the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel.
  • the number of all the neighboring nodes of the second node is two or more; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods; the first node queries the first heartbeat message without receiving the first heartbeat message.
  • the other neighboring nodes of the second node receive the first heartbeat message, and if it is determined that the other neighboring nodes of the second node have not received the first heartbeat packet, determining that the second node is faulty . Since the preset time is greater than or equal to one heartbeat cycle and less than two heartbeat weeks Therefore, when the fault detection is performed by using the technical solution provided by the present invention, the phenomenon that the node needs to be detected through multiple heartbeat cycles can be avoided in the prior art, and the cycle of fault detection is shortened, thereby improving node fault detection. s efficiency.
  • FIG. 8 is a schematic flowchart diagram of Embodiment 4 of a method for detecting a fault of a node in a cluster system according to the present invention.
  • the method in this embodiment may include:
  • Step 801 The first node generates first voting information, and receives second voting information sent by each other neighboring node.
  • the first voting information includes a node identifier corresponding to the node elected by the first node, and the second voting information includes sending the second voting information.
  • the neighbor node of the second node determines that the second node has failed, all neighbor nodes need to recalculate the respective neighbor nodes.
  • any one of the neighbor nodes of the second node may be used as the first node, and the first node needs to generate the first voting information, where the first voting information includes the node identifier corresponding to the node elected by the first node and the voting basis.
  • the first node also receives the second voting information sent by each of the other neighboring nodes, where the second voting information includes the node identifier corresponding to the node elected by the neighboring node that sends the second voting information, and the voting basis.
  • the voting basis is related to various factors, such as: load situation, node number size, node cache age and node network bandwidth, etc., for example, the first node can determine which node bears the least load. The node identifier corresponding to the node with the smallest load is carried in the first voting information and sent to other neighbor nodes. Similarly, other neighbor nodes can also send the second voting information to the first node in a similar manner.
  • Step 802 The first node counts the number of votes obtained by each node of all the elected nodes according to the node identifier in the first voting information and the node identifier in the second voting information sent by each other neighboring node, and votes
  • the node with the largest number is the third node;
  • the third node is the node that replaces the second node and sends the heartbeat message to all the neighbor nodes of the third node in parallel; all the neighbor nodes of the third node include the neighbors of the third node itself The node and the neighbor node of the second node.
  • the first node receives the second vote sent by each other neighbor node.
  • the third node may be determined according to the node identifier in the first voting information generated by itself and the node identifier in the received second voting information.
  • the number of votes obtained by each node of all the elected nodes may be counted according to the node identifier carried in the first voting information and the second voting information, and the number of votes obtained by each node of the selected nodes may be obtained. The most nodes are the third nodes.
  • the third node is configured to take over the neighbor node of the failed second node, that is, to take over the relationship between the second node and other nodes, so the third node will replace the second node and parallel to the third node. All neighbor nodes send a heartbeat message, wherein all neighbor nodes of the third node include a neighbor node of the second node in addition to the neighbor node of the third node.
  • Step 803 The first node re-determines the neighbor node of the first node according to the neighbor node of the third node and the node other than the third node among the other neighbor nodes.
  • the first node after all the neighbor nodes of the second node determine the third node by voting, if the first node is the third node, the first node will take over the neighbor relationship of the second node, and other neighbors.
  • the node may re-determine the respective neighbor nodes according to the neighbor relationship after the first node takes over the neighbor node of the second node; if the first node is not the third node, the first node will re-determine the phase to the third node.
  • the neighbor node of the third node and other nodes of the other neighbor nodes except the third node are re-determined.
  • the second node sends a first heartbeat message to the first node and other neighbor nodes in parallel, and the first node determines whether the second heart is received within a preset time.
  • a first heartbeat message sent by the node where the first node is a neighbor node of the second node, and the first heartbeat message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel,
  • the number of all the neighboring nodes of the two nodes is two or more; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods; the first node queries the first heartbeat message if it does not receive the first heartbeat packet Whether the other neighboring nodes of the second node receive the first heartbeat message, and if it is determined that the other neighboring nodes of the second node have not received the first heartbeat message, it is determined that the second node has failed.
  • the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods, when the fault is detected by using the technical solution provided by the present invention, it is avoided in the prior art that multiple heartbeat cycles are required to detect whether the node is faulty. Phenomenon, shortening the fault The period of detection increases the efficiency of node failure detection. In addition, after determining that the first node fails, the respective neighbor nodes are re-determined, and then the fault detection is continued, thereby improving the accuracy of the fault detection.
  • the first node determines, according to the received state carried in the response message sent by each of the other neighboring nodes, that the at least one other neighboring node receives the first heartbeat message, the first node determines The link between the node that has not received the first heartbeat message and the second node fails.
  • the first node does not receive the first heartbeat message sent by the second node, and sends a request message to each other node to query whether each other neighbor node receives the first heartbeat message, if The response message sent by each other neighboring node determines that at least one other neighboring node receives the first heartbeat message, and the first node may determine that the second node is normal, and may be the second node and the first node, And the link between the node that does not receive the first heartbeat message and the first node is faulty, where the node that does not receive the first heartbeat message includes the first node and the other neighbor node that does not receive the first Neighbor node of the heartbeat message.
  • the first node re-determines a neighbor node of the first node according to a neighboring node of the third node and a node other than the third node among the other neighboring nodes.
  • the first node determines that the first heartbeat packet is not received, because the first node determines that the at least one other neighbor node receives the first heartbeat packet.
  • the link between the node and the second node fails, making the fault detection more comprehensive.
  • FIG. 9 is a schematic structural diagram of Embodiment 1 of a fault detecting apparatus for a node in a cluster system according to the present invention.
  • the fault detecting apparatus 10 of a node in a cluster system according to an embodiment of the present invention includes a determining module 11 and a sending module 12, The receiving module 13, the determining module 14 and the generating module 15.
  • the determining module 11 is configured to determine whether the receiving module 13 receives the first heartbeat message sent by the second node in the preset time; the first node is a neighbor node of the second node, and the first heartbeat The message is a heartbeat message sent by the second node to each neighbor node of the second node in parallel, and the number of all neighbor nodes of the second node is two And the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods; and the determining module 11 determines that the receiving module 13 does not receive the first heartbeat message sent by the second node.
  • the sending module 12 is configured to send a request message to other neighbor nodes except the first node of all neighbor nodes of the second node; the request message is used to query whether the other neighbor nodes receive And the receiving module 13 is further configured to receive a response message that is sent by the other neighboring node and that carries the receiving state, where the receiving state is used to indicate whether the first heartbeat packet is received.
  • the determining module 14 is configured to determine, according to the receiving status carried in the response message sent by each of the other neighboring nodes received by the receiving module 13, whether the other neighboring nodes have not received the first heartbeat a message, where the determining module 14 determines that the other neighboring nodes have not received the first heartbeat message, the determining module 14 is further configured to determine the The second node has failed. Determining, by the determining module 14 , the determining module 14 is configured to determine, when the first node determines that the other neighboring nodes do not receive the first heartbeat packet according to the received state carried in the received response message. The second node
  • the determining module determines whether the receiving module receives the first heartbeat packet sent by the second node within a preset time, and the first heartbeat packet is the second node in parallel.
  • a heartbeat message sent to each neighbor node of the second node the number of all neighbor nodes of the second node is two or more; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods; the receiving module If the first heartbeat packet is not received, the sending module sends a request message to the other neighboring nodes of the second node to query whether the other neighboring node receives the first heartbeat packet, and determines, in the determining module, the first heartbeat packet.
  • the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods, when the fault is detected by using the technical solution provided by the present invention, it is avoided in the prior art that multiple heartbeat cycles are required to detect whether the node is faulty. The phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.
  • the generating module 15 is further configured to generate first voting information, where the first voting information includes a node identifier corresponding to the node that is elected by the first node;
  • the receiving module 13 is further configured to receive second voting information sent by each of the other neighboring nodes, where the second voting information includes a neighbor node that sends the second voting information to elect The node identifier corresponding to the node;
  • the determining module 14 is further configured to collect, according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighboring nodes, statistics obtained by each node of all the elected nodes.
  • the number of votes, and the node with the highest number of votes is the third node;
  • the third node is a node that replaces the second node and sends a heartbeat message to all neighbor nodes of the third node in parallel; All neighbor nodes of the third node include the neighbor node of the third node itself and the neighbor node of the second node.
  • the determining module 14 determines, according to the receiving state carried in the response message sent by each of the other neighboring nodes that is received by the receiving module 13, that the at least one other neighboring node receives the In the case of the first heartbeat message,
  • the determining module 14 is further configured to determine that a link between the node that does not receive the first heartbeat message and the second node is faulty; and the node that does not receive the first heartbeat message includes: A node that does not receive the first heartbeat message in the first node and the other neighboring nodes.
  • the determining module 14 is further configured to re-determine the neighbor node of the first node according to the neighboring node of the third node and the node other than the third node of the other neighboring nodes.
  • the fault detection device of the node in the cluster system of the embodiment may be used to implement the technical solution of the fault detection method of the node in the cluster system provided by any embodiment of the present invention, and the implementation principle and the technical effect thereof are similar, and details are not described herein again. .
  • the fault detection system 20 of a node in a cluster system according to an embodiment of the present invention includes a first node 21 and a second node. 22 and other neighbor nodes 23, the first node 21 is a neighbor node of the second node 22, and the other neighbor nodes 23 are all neighbor nodes of the second node 22 except the first node 21
  • the number of the other nodes is one or more.
  • the second node 22 is configured to send a first heartbeat message to the first node and the other neighboring nodes in parallel; the first node 21 is configured to determine whether the first heartbeat is received within a preset time. a first heartbeat message; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods; and the first node does not receive the first heartbeat message
  • the first node 21 is further configured to send a request message to each of the other neighboring nodes, where the request message is used to query whether each of the other neighboring nodes receives the first heartbeat message;
  • the first node 21 is further configured to receive, by each of the other neighboring nodes, a response message carrying a receiving state, where the receiving state is used to indicate whether the first heartbeat message is received; And determining, according to the received state carried in the response message sent by each of the other neighboring nodes, that the other neighboring nodes have not received the first heartbeat message, the first node 21 is further configured to determine that the second node
  • the determining module determines whether the receiving module receives the first heartbeat packet sent by the second node within a preset time, and the first heartbeat packet is in parallel with the second node.
  • a heartbeat message sent to each neighbor node of the second node the number of all neighbor nodes of the second node is two or more; the preset time is greater than or equal to one heartbeat period, and is less than two heartbeat periods; If the module does not receive the first heartbeat message, the sending module sends a request message to the other neighboring nodes of the second node to query whether the other neighboring node receives the first heartbeat message, and the determining module determines the If the other neighbor nodes of the second node have not received the first heartbeat message, it is determined that the second node has failed.
  • the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods, when the fault is detected by using the technical solution provided by the present invention, it is avoided in the prior art that multiple heartbeat cycles are required to detect whether the node is faulty.
  • the phenomenon shortens the cycle of fault detection, thereby improving the efficiency of node fault detection.
  • the first node 21 further includes: the first node 21 is further configured to:
  • first voting information includes a node identifier corresponding to the node that is elected by the first node
  • second voting information includes sending a node identifier corresponding to a node elected by a neighbor node of the second voting information
  • the third a node is a node that replaces the second node and sends a heartbeat message to all neighbor nodes of the third node in parallel; all neighbor nodes of the third node include a neighbor node and a location of the third node itself A neighbor node of the second node.
  • the first node determines, according to the received state carried in the response message sent by each of the other neighboring nodes, that at least one of the other neighboring nodes receives the first In the case of a heartbeat message,
  • the first node 21 is further configured to determine that a link between the node that does not receive the first heartbeat packet and the second node is faulty; and the node that does not receive the first heartbeat packet And including a node that does not receive the first heartbeat message in the first node and the other neighboring nodes.
  • the first node 21 is further configured to: re-determine the first node according to the neighboring node of the third node and the node other than the third node among the other neighboring nodes. Neighbor node.
  • FIG. 11 is a schematic structural diagram of Embodiment 1 of a node according to the present invention.
  • the node 600 of this embodiment includes a processor 601, a user interface 603, a network interface 604, and a memory 605, a transmitter 606, and a receiver 607.
  • 605 can include an operating system 6051, an application 6052, and the like.
  • the processor 601 can be a Central Processing Unit (CPU).
  • Memory 605 is used to store executable instructions.
  • the processor 601 can execute executable instructions stored in the memory 605.
  • the receiver 607 is configured to receive the first heartbeat packet sent by the second node, and the processor 601 is configured to determine whether the receiver 607 receives the first heartbeat packet sent by the second node within a preset time.
  • the first heartbeat message is a heartbeat message sent by the second node to each neighboring node of the second node in parallel, and the number of all neighbor nodes of the second node is two or more;
  • the preset time is greater than or equal to one heartbeat period and less than two heartbeat periods; if the processor 601 determines that the receiver 607 does not receive the first heartbeat message sent by the second node,
  • the transmitter 606 is configured to send, to all neighbor nodes of the second node, other neighbor nodes except the first node, a request message, where the request message is used to query whether the other neighbor nodes receive the a first heartbeat message, the first node is a neighbor node of the second node; the receiver 607 is further And receiving, by the other neighboring node, a response message carrying a receiving state, where the receiving state is used to indicate whether the first heartbeat message is received; and the processor 601 is configured to receive according to the receiver 607.
  • the processor 601 is further configured to determine that the second node is faulty, if the other neighboring nodes do not receive the first heartbeat message.
  • the node provided in this embodiment may be used to perform the technical solution of the fault detection method of the node in the cluster system provided by any embodiment of the present invention.
  • the implementation principle and technical effects are similar, and details are not described herein again.
  • the processor 601 is further configured to generate first voting information, where the first voting information includes a node identifier corresponding to the node that is elected by the first node;
  • the receiver 607 is further configured to receive second voting information sent by each of the other neighboring nodes, where the second voting information includes a node identifier corresponding to a node that is elected by the neighboring node that sends the second voting information;
  • the processor 601 is further configured to collect, according to the node identifier in the first voting information and the node identifier in the second voting information sent by each of the other neighboring nodes, statistics obtained by each node of all the elected nodes.
  • the number of votes, and the node with the highest number of votes is the third node;
  • the third node is a node that replaces the second node and sends a heartbeat message to all neighbor nodes of the third node in parallel; All neighbor nodes of the third node include the neighbor node of the third node itself and the neighbor node of the second node.
  • the processor 601 determines, according to the receiving status carried in the response message sent by each of the other neighboring nodes that is received by the receiver 607, that the at least one other neighboring node receives the In the case of the first heartbeat message, the processor 601 is further configured to determine that a link between the node that does not receive the first heartbeat message and the second node is faulty; The node of the first heartbeat message includes a node that does not receive the first heartbeat message in the first node and the other neighboring nodes.
  • the processor 601 is further configured to re-determine the neighbor node of the first node according to the neighboring node of the third node and the node other than the third node of the other neighboring nodes.
  • the node provided in this embodiment may be used to perform the set provided by any embodiment of the present invention.
  • the technical solution of the fault detection method of the node in the group system is similar to the technical effect, and will not be described here.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hardware Redundancy (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

本发明实施例提供一种集群系统中节点的故障检测方法和装置,该方法包括:第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文;在第一节点未接收到第二节点发送的心跳报文的情况下,向第二节点的所有邻居节点中除第一节点之外的其他邻居节点发送请求消息;第一节点接收其他邻居节点发送的携带有接收状态的响应消息;在第一节点根据接收状态确定出其他邻居节点均未接收到心跳报文的情况下,第一节点确定第二节点发生故障。本发明实施例提供的集群系统中节点的故障检测方法和装置能够提高节点故障检测的效率。

Description

集群系统中节点的故障检测方法和装置
本申请要求于2015年06月05日提交中国专利局、申请号为201510306800.0、发明名称为“集群系统中节点的故障检测方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及通信技术,尤其涉及一种集群系统中节点的故障检测方法和装置。
背景技术
在分布式集群系统中,通常包括一个中心节点和多个普通节点,当中心节点或者普通节点发生故障后,将对分布式集群系统的可靠性造成很大的影响,因此,如何有效的进行节点的故障检测,是非常重要的。
图1为现有技术中节点的故障检测方法的示意图,如图1所示,普通节点(B、C、D、E)根据心跳周期向中心节点(M)发送心跳报文,中心节点(M)根据检测周期内收到的连续心跳报文的情况,来检测普通节点是否故障,其中,一个检测周期可以包含多个心跳周期。同时,中心节点(M)也可以周期性的向普通节点(B、C、D、E)发送心跳报文,以通知普通节点中心节点所担任的角色以及是否处于正常状态,一旦普通节点(B、C、D、E)在检测周期内未收到中心节点(M)发送的心跳报文,则会判断出中心节点(M)发生故障,此时,普通节点会发起重新选举中心节点的操作,若选举成功,普通节点将感知新的中心节点,并将心跳报文发送到新的中心节点,集群再进行故障检测。
然而,在现有技术中,通过判断在检测周期内是否接收到心跳报文的方式来检测节点是否发生故障时,由于在集群规模固定的情况下,发送心跳报文的心跳周期无法改变,因此检测周期的时间也无法改变,使得节点故障检测需要通过多个心跳周期才能检测出来,造成节点故障检测的周期较长,导致节点故障检测的效率较低。
发明内容
本发明实施例提供一种集群系统中节点的故障检测方法和装置,用 于解决现有技术存在着的节点故障检测需要通过多个心跳周期才能检测出来,造成节点故障检测的周期较长的问题,从而提高了节点故障检测的效率。
第一方面,本发明实施例提供一种集群系统中节点的故障检测方法,包括:
第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;
在所述第一节点未接收到所述第二节点发送的第一心跳报文的情况下,所述第一节点向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息,所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文;
所述第一节点接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点确定所述第二节点发生故障。
结合第一方面,在第一方面的第一种可能的实现方式中,所述第一节点确定所述第二节点发生故障之后,还包括:
所述第一节点生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
所述第一节点根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三 节点自身的邻居节点和所述第二节点的邻居节点。
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,还包括:
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,所述第一节点确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。
结合第一方面、第一方面的第一种至第一方面的第二种任一种可能的实现方式,在第一方面的第三种可能的实现方式中,还包括:
所述第一节点根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。
第二方面,本发明实施例提供一种集群系统中节点的故障检测方法,所述方法包括:
第二节点并行地向第一节点和其他邻居节点发送第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述其他邻居节点为所述第二节点的所有邻居节点中除所述第一节点之外的节点,所述其他邻居节点的数目为一个以上;
所述第一节点判断在预设时间内是否接收到所述第一心跳报文;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;
在所述第一节点未接收到所述第一心跳报文的情况下,所述第一节点向每一所述其他邻居节点分别发送请求消息,所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文;
所述第一节点接收每一所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;
在所述第一节点根据接收到的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点确定所述第二节点发生故障。
结合第二方面,在第二方面的第一种可能的实现方式中,所述第一 节点确定所述第二节点发生故障之后,还包括:
所述第一节点生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
所述第一节点根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,还包括:
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,则所述第一节点确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到第一心跳报文的节点。
结合第二方面、第二方面的第一种至第二方面的第二种任一种可能的实现方式,在第二方面的第三种可能的实现方式中,还包括:
所述第一节点根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。
第三方面,本发明实施例提供一种集群系统中节点的故障检测装置,包括:
判断模块,用于判断在预设时间内是否接收到第二节点发送的第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;
在所述判断模块判断出接收模块未接收到所述第二节点发送的第一心跳报文的情况下,
发送模块,用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息,所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文;
所述接收模块,还用于接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;
确定模块,用于根据所述接收模块接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定是否所述其他邻居节点均未接收到所述第一心跳报文;
在所述确定模块确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述确定模块,还用于确定所述第二节点发生故障。
结合第三方面,在第三方面的第一种可能的实现方式中,在所述确定模块确定所述第二节点发生故障之后,还包括:
生成模块,还用于生成第一投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;
所述接收模块,还用于接收每一所述其他邻居节点发送的第二投票信息,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
所述确定模块,还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
结合第三方面或第三方面的第一种可能的实现方式,在第三方面的第二种可能的实现方式中,
在所述确定模块根据所述接收模块接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,
所述确定模块还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。
结合第三方面、第三方面的第一种至第三方面的第二种任一种可能的实现方式,在第三方面的第三种可能的实现方式中,
所述确定模块还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。
第四方面,本发明实施例提供一种集群系统中节点的故障检测系统,包括第一节点、第二节点和其他邻居节点,所述第一节点为所述第二节点的邻居节点,所述其他邻居节点为所述第二节点的所有邻居节点中除所述第一节点之外的节点,所述其他邻居节点的数目为一个以上,包括:
所述第二节点,用于并行地向所述第一节点和所述其他邻居节点发送第一心跳报文;
所述第一节点,用于判断在预设时间内是否接收到所述第一心跳报文;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;
在所述第一节点未接收到所述第一心跳报文的情况下,所述第一节点还用于向每一所述其他邻居节点分别发送请求消息,所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文;以及,所述第一节点还用于接收每一所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点还用于确定所述第二节点发生故障。
结合第四方面,在第四方面的第一种可能的实现方式中,所述第一节点确定所述第二节点发生故障之后,还包括:
所述第一节点还用于:
生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
以及,根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
结合第四方面或第四方面的第一种可能的实现方式,在第四方面的第二种可能的实现方式中,
在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,
所述第一节点还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的邻居节点。
结合第四方面、第四方面的第一种至第四方面的第二种任一种可能的实现方式,在第四方面的第三种可能的实现方式中,
所述第一节点还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。
本发明实施例提供的集群系统中节点的故障检测方法和装置中,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问 该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,可以缩短故障检测的周期,从而提高了节点故障检测的效率。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为现有技术中集群系统中节点的故障检测方法的结构示意图;
图2为本发明提供的集群系统中节点的故障检测方法实施例一的流程示意图;
图3为集群系统中节点之间相邻关系的示意图一;
图4为集群系统中节点之间相邻关系的示意图二;
图5为本发明提供的集群系统中节点的故障检测方法实施例二的流程示意图;
图6A为集群系统中检测到节点故障之前节点之间相邻关系的示意图;
图6B为集群系统中检测到节点故障之后重新确定节点之间相邻关系的示意图;
图7为本发明提供的集群系统中节点的故障检测方法实施例三的流程示意图;
图8为本发明提供的集群系统中节点的故障检测方法实施例四的流程示意图;
图9为本发明集群系统中节点的故障检测装置实施例一的结构示意 图;
图10为本发明集群系统中节点的故障检测系统实施例一的结构示意图图10为本发明节点实施例一的结构示意图;
图11为本发明节点实施例一的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例适用于集群系统中,其具体适用于分布式集群系统中节点的故障检测的场景。该分布式集群系统包括至少两个节点,该节点例如可以是计算机。可选的,本实施例中的集群系统中的节点与现有的集群系统的不同之处在于:本实施例的集群系统中,将所有的节点都赋予相同的功能,即所有的节点都具有相同的接收心跳报文和发送心跳报文的能力,因此,在本实施例的集群系统中,并不存在中心节点和普通节点的区分,也不需要中心节点管理普通节点。可选的,下述实施例的技术方案均以计算机作为执行主体来介绍。
图2为本发明提供的集群系统中节点的故障检测方法实施例一的流程示意图。本发明实施例涉及的方法适用于分布式集群系统。本实施例以计算机作为执行主体为例来介绍。如图2所示,本实施例的方法可以包括:
步骤201、第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文;第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;预设时间大于或等于一个心跳周期,且小于两个心跳周期。
在本实施例中,第二节点根据集群系统中所有节点的信息,按照集群系统中预设的规则确定出第一节点,其中,第一节点为第二节点的任 意一个邻居节点,第二节点的邻居节点为与第二节点有关联关系的节点。图3为集群系统中节点之间相邻关系的示意图一,如图3所示,在集群系统中,节点E根据所有节点的信息,按照集群系统中预设的规则可以确定出其有四个邻居节点,分别是节点A、B、C和D。其中,第一节点可以为节点A、B、C和D中的任意一个。第一节点通过判断在预设时间内是否接收到第二节点发送的第一心跳报文,来检测第二节点是否发生故障。需要进行说明的是,第二节点是通过并行地向它的所有邻居节点发送心跳报文的,因此,第一心跳报文为第二节点并行地在同一个时刻向第二节点的每一邻居节点发送的一个心跳报文。另外,第二节点可以根据心跳周期并行地向它的所有邻居节点发送第一心跳报文,因此,第一节点可以判断在大于或等于一个心跳周期,且小于两个心跳周期的时间内是否接收到该第二节点发送的第一心跳报文。例如:假设心跳周期为5s,即第二节点每隔5s,将并行地向它的所有邻居节点发送一次心跳报文,对于第二节点在第5s发送的第一心跳报文,第一节点将判断在大于或等于5s,且在小于10s的时间内是否接收到第二节点发送的第一心跳报文。其中,心跳周期可以根据经验或者实际情况进行设置,对于心跳周期的具体取值,本实施例在此不作限制。
另外,第二节点可以通过一个物理网络周期性地向第一节点发送第一心跳报文,但是由于基于单物理网络进行故障检测时,在网络发生故障,例如:管理平面网络发生故障,而业务平面网络正常时,往往无法界定是集群系统中第二节点发生了故障还是第二节点和第一节点之间的链路发生了故障,或者第二节点和第一节点同时发生了故障,由此,导致故障的检测结果不准确。为了解决这一问题,优选地,本实施例中还可以通过至少两个网络发送第一心跳报文,举例来说,可以通过双平面发送第一心跳报文,例如:管理平面和业务平面,也可以通过三平面发送第一心跳报文,例如:管理平面、业务平面和信令平面。采用多物理网络的方式发送第一心跳报文,来检测节点是否发生故障,可以提高检测的准确性。需要进行说明的是,若物理网络的数量为至少两个时,该至少两个物理网络之间相互隔离,这样可以避免由于多网络之间存在共用某些设备时,若共用设备发生故障,从而导致节点之间无法正常通信 的现象,有利于提高检测的准确性。
步骤202、在第一节点未接收到第二节点发送的第一心跳报文的情况下,第一节点向第二节点的所有邻居节点中除第一节点之外的其他邻居节点发送请求消息,请求消息用于询问其他邻居节点是否接收到第一心跳报文。
在现有技术中,在普通节点发送到中心节点的心跳周期固定的情况下,因为中心节点的性能的限制,集群系统无法无限增加普通节点,使得集群系统的扩展性受到影响。针对这一问题,本发明实施例中,若第一节点并未在预设时间内接收到第二节点发送的第一心跳报文,即可初步确定第二节点有可能发生了故障。由于第二节点是并行地向它的所有邻居节点发送的第一心跳报文,因此,第一节点将向第二节点的邻居节点中,除自身以外的其他邻居节点发送请求消息,以询问其他邻居节点是否接收到第二节点发送的第一心跳报文。由此可见,当第一节点未接收到第二节点发送的第一心跳报文时,第一节点可以向第二节点的其他邻居节点发送请求消息,而且第二节点的非邻居节点也将不再给第二节点发送心跳报文,由此可以减少第二节点处理心跳报文的数量,从而可以减轻第二节点的负担,使得集群系统的可扩展性较好。
举例来说,图4为集群系统中节点之间相邻关系的示意图二,如图4所示,节点E的邻居节点有X、A、D、C和G,节点E将在每个心跳周期内向它的所有邻居节点X、A、D、C和G发送心跳报文,假设将节点E作为第二节点,将节点A作为第一节点,若在某一个心跳周期内,第一节点A未接收到第二节点E发送的第一心跳报文,则第一节点A将会向其他邻居节点X、D、C和G发送请求消息,以询问节点X、D、C和G是否接收到第一心跳报文。
步骤203、第一节点接收其他邻居节点发送的携带有接收状态的响应消息,该接收状态用于表示是否接收到第一心跳报文。
在本实施例中,其他邻居节点接收到第一节点发送的请求消息后,将自身是否接收到第一心跳报文的接收状态携带在响应消息中发送给第一节点。
步骤204、在第一节点根据接收到的每一其他邻居节点发送的响应消 息中携带的接收状态,确定出其他邻居节点均未接收到第一心跳报文的情况下,第一节点确定第二节点发生故障。
在本实施例中,每一个其他邻居节点在接收到第一节点发送的请求消息之后,都会向第一节点返回携带有接收状态的响应消息,第一节点根据接收到的每一其他邻居节点发送的携带有接收状态的响应消息,判断其他邻居节点是否接收到第一心跳报文,在判断出其他邻居节点均没有接收到第二节点发送的第一心跳报文的情况下,即可确定出第二节点发生了故障。
需要进行说明的是,节点之间的相邻关系是双向的,即形成邻居关系的节点之间可以相互发送心跳报文,因此,第二节点的所有邻居节点都会单独的执行步骤201-步骤204。
本发明实施例提供的集群系统中节点的故障检测方法中,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。
图5为本发明提供的集群系统中节点的故障检测方法实施例二的流程示意图。在图2所示实施例的基础上,对第一节点确定第二节点发生故障之后,各节点重新确定邻居节点的实施例,作详细说明。如图5所示,本实施例的方法可以包括:
步骤501、第一节点生成第一投票信息,并接收每一其他邻居节点发送的第二投票信息,第一投票信息包括第一节点选举的节点对应的节点标识;第二投票信息包括发送第二投票信息的邻居节点选举的节点对应 的节点标识。
在本实施例中,当第二节点的邻居节点确定出第二节点发生故障之后,所有的邻居节点均需要重新计算各自的邻居节点。为便于说明,可以将第二节点的任意一个邻居节点作为第一节点,第一节点需要生成第一投票信息,该第一投票信息中包含第一节点选举的节点对应的节点标识以及投票依据。另外,第一节点还要接收每一其他邻居节点发送的第二投票信息,第二投票信息中包括发送第二投票信息的邻居节点选举的节点对应的节点标识以及投票依据。在实际应用中,投票依据与多种因素有关,例如:负载情况、节点编号的大小、节点缓存新旧程度以及节点网络带宽等,如:第一节点可以通过判断哪一个节点所承担的负载最小,并将负载最小的该节点对应的节点标识携带在第一投票信息中发送给其他邻居节点。同样的,其他邻居节点也可以用类似的方式,将第二投票信息发送给第一节点。
步骤502、第一节点根据第一投票信息中的节点标识和每一其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中的每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;第三节点为替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点;第三节点的所有邻居节点包括第三节点自身的邻居节点和第二节点的邻居节点。
在本实施例中,第一节点在接收到每一其他邻居节点发送的第二投票信息后,根据自身生成的第一投票信息中的节点标识和接收到的第二投票信息中的节点标识,可以确定出第三节点。在具体的实现过程中,可以根据第一投票信息和第二投票信息中携带的节点标识,通过投票选举的方式,统计被选举的所有节点中每一节点获得的投票数量,并将获得投票数量最多的节点作为第三节点。第三节点用于接管发生故障的第二节点的邻居节点,也即接管第二节点与其他节点之间的关联关系,因此,第三节点将替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点,其中,第三节点的所有邻居节点除了包括第三节点自身的邻居节点之外,还包括第二节点的邻居节点。
步骤503、第一节点根据第三节点的邻居节点和其他邻居节点中除第 三节点之外的节点,重新确定第一节点的邻居节点。
在本实施例中,第二节点的所有邻居节点通过投票选举的方式确定出第三节点之后,若第一节点为第三节点,则第一节点将接管第二节点的相邻关系,其他邻居节点可以根据第一节点接管第二节点的邻居节点后的相邻关系,重新通过计算确定各自的邻居节点;若第一节点不是第三节点,则第一节点将待第三节点重新确定出相邻关系之后,根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点,重新确定自身的邻居节点。
举例来说,图6A为集群系统中检测到节点故障之前节点之间相邻关系的示意图,图6B为集群系统中检测到节点故障之后重新确定节点之间相邻关系的示意图。如图6A所示,假设节点E为第二节点,节点A为第一节点,当第一节点A确定第二节点E发生故障之后,第一节点A将生成第一投票信息,并分别接收节点X、D、C和G发送的第二投票信息,第一节点A根据第一投票信息中的节点标识和第二投票信息中的节点标识确定出第三节点,以使第三节点替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文。如图6B所示,若通过投票选举,确定出第一节点A为第三节点,则由第一节点A来替代第二节点、且并行地向第一节点A的所有邻居节点发送心跳报文,此时,第一节点A需要通过其他邻居节点X、D、C和G重新确定自己的邻居节点,而节点X、D、C和G在等第一节点A确定好自己的邻居节点之后,根据第一节点A确定出的邻居节点重新确定各自的邻居节点。
本发明实施例提供的集群系统中节点的故障检测方法,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周 期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。另外,通过在确定第一节点发生故障之后,重新确定各自的邻居节点,进而继续进行故障检测,提高了故障检测的准确性。
可选地,在第一节点根据接收到的每一其他邻居节点发送的响应消息中携带的接收状态,确定出至少一个其他邻居节点接收到第一心跳报文的情况下,第一节点确定所述未接收到第一心跳报文的节点与第二节点之间的链路发生故障。
具体地,第一节点在未接收到第二节点发送的第一心跳报文,并向每一其他节点发送请求消息,以询问每一其他邻居节点是否接收到第一心跳报文之后,若根据每一其他节点发送的响应消息确定出至少有一个其他邻居节点接收到了第一心跳报文,则第一节点可以确定出第二节点是正常的,而可能是第二节点和第一节点、以及未接收到第一心跳报文的节点与第一节点之间的链路发生了故障,其中,未接收到第一心跳报文的节点包括第一节点和其他邻居节点中未接收到第一心跳报文的邻居节点。
本发明实施例提供的集群系统中节点的故障检测方法,由于第一节点在确定出至少一个其他邻居节点接收到第一心跳报文的情况下,第一节点确定未接收到第一心跳报文的节点与第二节点之间的链路发生故障,使得故障检测更加全面。
图7为本发明提供的集群系统中节点的故障检测方法实施例三的流程示意图。本发明实施例涉及的方法适用于分布式集群系统。本实施例中仍然以计算机作为执行主体为例进行介绍。如图7所示,本实施例的方法可以包括:
步骤701、第二节点并行地向第一节点和其他邻居节点发送第一心跳报文,第一节点为第二节点的邻居节点;其他邻居节点为第二节点的所有邻居节点中除第一节点之外的节点,其他邻居节点的数目为一个以上。
在本实施例中,第二节点可以根据集群系统中所包含的节点的信 息,根据集群系统中预设的规则确定出自身所有的邻居节点,其中,第一节点为第二节点的任意一个邻居节点,第二节点的邻居节点为与该第二节点有关联关系的节点。第二节点在确定出所有的邻居节点之后,会并行地向第一节点和其他邻居节点发送第一心跳报文。
步骤702、第一节点判断在预设时间内是否接收到第一心跳报文;预设时间大于或等于一个心跳周期,且小于两个心跳周期。
在本实施例中,第二节点可以根据心跳周期并行地向它的所有邻居节点发送第一心跳报文,因此,第一节点可以判断在大于或等于一个心跳周期,且小于两个心跳周期的时间内是否接收到该第二节点发送的第一心跳报文。例如:假设心跳周期为5s,即第二节点每隔5s,将并行地向它的邻居节点发送一次心跳报文,对于第二节点在第5s发送的第一心跳报文,第一节点将判断在大于等于5s,且在小于10s的时间内是否接收到第二节点发送的第一心跳报文。其中,心跳周期可以根据经验或者实际情况进行设置,对于心跳周期的具体取值,本实施例在此不作限制。
另外,第二节点可以通过一个物理网络周期性地向第一节点发送第一心跳报文,但是由于基于单物理网络进行故障检测时,在网络发生故障,例如:管理平面网络发生故障,而业务平面网络正常时,往往无法界定是集群系统中第二节点发生了故障还是第二节点和第一节点之间的链路发生了故障,或者第二节点和第一节点同时发生了故障,由此,导致故障的检测结果不准确。为了解决这一问题,优选地,本实施例中还可以通过至少两个网络发送第一心跳报文,举例来说,可以通过双平面发送第一心跳报文,例如:管理平面和业务平面,也可以通过三平面发送第一心跳报文,例如:管理平面、业务平面和信令平面。采用多物理网络的方式发送第一心跳报文,来检测节点是否发生故障,可以提高检测的准确性。需要进行说明的是,若物理网络的数量为至少两个时,该至少两个物理网络之间相互隔离,这样可以避免由于多网络之间存在共用某些设备时,若共用设备发生故障,从而导致节点之间无法正常通信的现象,有利于提高检测的准确性。
步骤703、在第一节点未接收到第一心跳报文的情况下,第一节点向每一其他邻居节点分别发送请求消息,请求消息用于询问每一其他邻居 节点是否接收到所述第一心跳报文。
在本实施例中,若第一节点并未在预设时间内接收到第二节点发送的第一心跳报文,即可初步确定第二节点有可能发生了故障。由于第二节点是并行地向它的所有邻居节点发送的第一心跳报文,因此,第一节点将向第二节点的邻居节点中,除自身以外的其他邻居节点发送请求消息,以询问其他邻居节点是否接收到第二节点发送的第一心跳报文。
步骤704、第一节点接收每一其他邻居节点发送的携带有接收状态的响应消息,接收状态用于表示是否接收到第一心跳报文。
在本实施例中,每一其他邻居节点接收到第一节点发送的请求消息后,将自身是否接收到第一心跳报文的接收状态携带在响应消息中发送给第一节点。
步骤705、在第一节点根据接收到的响应消息中携带的接收状态,确定出其他邻居节点均未接收到第一心跳报文的情况下,第一节点确定第二节点发生故障。
在本实施例中,每一个其他邻居节点在接收到第一节点发送的请求消息之后,都会向第一节点返回携带有接收状态的响应消息,第一节点根据接收到的每一其他邻居节点发送的携带有接收状态的响应消息,判断其他邻居节点是否接收到第一心跳报文,在判断出其他邻居节点均没有接收到第二节点发送的第一心跳报文时,即可确定出第二节点发生了故障。
本发明实施例提供的集群系统中节点的故障检测方法中,第二节点通过并行地向第一节点和其他邻居节点发送第一心跳报文,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周 期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。
图8为本发明提供的集群系统中节点的故障检测方法实施例四的流程示意图。在图7所示实施例的基础上,对第一节点确定第二节点发生故障之后,各节点重新确定邻居节点的实施例,作详细说明。如图8所示,本实施例的方法可以包括:
步骤801、第一节点生成第一投票信息,并接收每一其他邻居节点发送的第二投票信息,第一投票信息包括第一节点选举的节点对应的节点标识;第二投票信息包括发送第二投票信息的邻居节点选举的节点对应的节点标识。
在本实施例中,当第二节点的邻居节点确定出第二节点发生故障之后,所有的邻居节点均需要重新计算各自的邻居节点。为便于说明,可以将第二节点的任意一个邻居节点作为第一节点,第一节点需要生成第一投票信息,该第一投票信息中包含第一节点选举的节点对应的节点标识以及投票依据。另外,第一节点还要接收每一其他邻居节点发送的第二投票信息,该第二投票信息中包括发送第二投票信息的邻居节点选举的节点对应的节点标识以及投票依据。在实际应用中,投票依据与多种因素有关,例如:负载情况、节点编号的大小、节点缓存新旧程度以及节点网络带宽等,如:第一节点可以通过判断哪一个节点所承担的负载最小,并将负载最小的该节点对应的节点标识携带在第一投票信息中发送给其他邻居节点。同样的,其他邻居节点也可以用类似的方式,将第二投票信息发送给第一节点。
步骤802、第一节点根据第一投票信息中的节点标识和每一其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;第三节点为替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文的节点;第三节点的所有邻居节点包括第三节点自身的邻居节点和第二节点的邻居节点。
在本实施例中,第一节点在接收到每个其他邻居节点发送的第二投 票信息后,根据自身生成的第一投票信息中的节点标识和接收到的第二投票信息中的节点标识,可以确定出第三节点。在具体的实现过程中,可以根据第一投票信息和第二投票信息中携带的节点标识,通过投票选举的方式,统计被选举的所有节点中每一节点获得的投票数量,并将获得投票数量最多的节点作为第三节点。第三节点用于接管发生故障的第二节点的邻居节点,也即接管第二节点与其他节点之间的关联关系,因此,第三节点将替代第二节点、且并行地向第三节点的所有邻居节点发送心跳报文,其中,第三节点的所有邻居节点除了包括第三节点自身的邻居节点之外,还包括第二节点的邻居节点。
步骤803、第一节点根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点,重新确定第一节点的邻居节点。
在本实施例中,第二节点的所有邻居节点通过投票选举的方式确定出第三节点之后,若第一节点为第三节点,则第一节点将接管第二节点的相邻关系,其他邻居节点可以根据第一节点接管第二节点的邻居节点后的相邻关系,重新通过计算确定各自的邻居节点;若第一节点不是第三节点,则第一节点将待第三节点重新确定出相邻关系之后,根据第三节点的邻居节点和其他邻居节点中除第三节点之外的节点,重新确定自身的邻居节点。
本发明实施例提供的集群系统中节点的故障检测方法,第二节点通过并行地向第一节点和其他邻居节点发送第一心跳报文,第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文,其中,第一节点为第二节点的邻居节点,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;第一节点在自身未接收到第一心跳报文的情况下,询问该第二节点的其他邻居节点是否接收到第一心跳报文,并在确定该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障 检测的周期,从而提高了节点故障检测的效率。另外,通过在确定第一节点发生故障之后,重新确定各自的邻居节点,进而继续进行故障检测,提高了故障检测的准确性。
可选地,在第一节点根据接收到的每一其他邻居节点发送的响应消息中携带的接收状态,确定出至少一个其他邻居节点接收到第一心跳报文的情况下,第一节点确定所述未接收到第一心跳报文的节点与第二节点之间的链路发生故障。
具体地,第一节点在未接收到第二节点发送的第一心跳报文,并向每一其他节点发送请求消息,以询问每一其他邻居节点是否接收到第一心跳报文之后,若根据每一其他邻居节点发送的响应消息确定出至少有一个其他邻居节点接收到了第一心跳报文,则第一节点可以确定出第二节点是正常的,而可能是第二节点和第一节点、以及未接收到第一心跳报文的节点与第一节点之间的链路发生了故障,其中,未接收到第一心跳报文的节点包括第一节点和其他邻居节点中未接收到第一心跳报文的邻居节点。
可选地,所述第一节点根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。
本发明实施例提供的集群系统中节点的故障检测方法,由于第一节点在确定出至少一个其他邻居节点接收到第一心跳报文的情况下,第一节点确定未接收到第一心跳报文的节点与第二节点之间的链路发生故障,使得故障检测更加全面。
图9为本发明集群系统中节点的故障检测装置实施例一的结构示意图,如图9所示,本发明实施例提供的集群系统中节点的故障检测装置10包括判断模块11、发送模块12、接收模块13、确定模块14和生成模块15。
其中,判断模块11用于判断在预设时间内接收模块13是否接收到第二节点发送的第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两 个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;在所述判断模块11判断出所述接收模块13未接收到所述第二节点发送的第一心跳报文的情况下,发送模块12用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息;所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文;所述接收模块13还用于接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;确定模块14用于根据所述接收模块13接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定是否所述其他邻居节点均未接收到所述第一心跳报文;在所述确定模块14确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述确定模块14还用于确定所述第二节点发生故障。在所述第一节点根据接收到的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,确定模块14用于确定所述第二节点发生故障。
本发明实施例提供的集群系统中节点的故障检测装置,判断模块判断在预设时间内接收模块是否接收到第二节点发送的第一心跳报文,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;接收模块在未接收到第一心跳报文的情况下,发送模块向该第二节点的其他邻居节点发送请求消息,以询问其他邻居节点是否接收到第一心跳报文,并在确定模块确定出该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。
可选地,生成模块15还用于生成第一投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;
所述接收模块13还用于接收每一所述其他邻居节点发送的第二投票信息,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的 节点对应的节点标识;
所述确定模块14还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
可选地,在所述确定模块14根据所述接收模块13接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,
所述确定模块14还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。
可选地,所述确定模块14还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。
本实施例的集群系统中节点的故障检测装置,可以用于执行本发明任意实施例所提供的集群系统中节点的故障检测方法的技术方案,其实现原理和技术效果类似,此处不再赘述。
图10为本发明集群系统中节点的故障检测系统实施例一的结构示意图,如图10所示,本发明实施例提供的集群系统中节点的故障检测系统20包括第一节点21、第二节点22和其他邻居节点23,所述第一节点21为所述第二节点22的邻居节点,所述其他邻居节点23为所述第二节点22的所有邻居节点中除所述第一节点21之外的节点,所述其他邻居节点23的数目为一个以上。
其中,所述第二节点22用于并行地向所述第一节点和所述其他邻居节点发送第一心跳报文;所述第一节点21用于判断在预设时间内是否接收到所述第一心跳报文;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;在所述第一节点未接收到所述第一心跳报文的情况 下,所述第一节点21还用于向每一所述其他邻居节点分别发送请求消息,所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文;所述第一节点21还用于接收每一所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点21还用于确定所述第二节点发生故障。
本发明实施例提供的集群系统中节点的故障检测系统中,判断模块判断在预设时间内接收模块是否接收到第二节点发送的第一心跳报文,第一心跳报文为第二节点并行地向第二节点的每一邻居节点发送的一个心跳报文,第二节点的所有邻居节点的数目为两个以上;该预设时间大于或等于一个心跳周期,且小于两个心跳周期;接收模块在未接收到第一心跳报文的情况下,发送模块向该第二节点的其他邻居节点发送请求消息,以询问其他邻居节点是否接收到第一心跳报文,并在确定模块确定出该第二节点的其他邻居节点也均未接收到该第一心跳报文的情况下,确定第二节点发生了故障。由于预设时间大于或等于一个心跳周期,且小于两个心跳周期,所以采用本发明提供的技术方案进行故障检测时,避免了现有技术中需要通过多个心跳周期才能检测出节点是否发生故障的现象,缩短了故障检测的周期,从而提高了节点故障检测的效率。
在上述实施例中,所述第一节点21确定所述第二节点发生故障之后,还包括:所述第一节点21还用于:
生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
以及,根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点,所述第三 节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
在上述实施例中,在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,
所述第一节点21还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。
在上述实施例中,所述第一节点21还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。
上述系统实施例对应地可用于执行方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
图11为本发明节点实施例一的结构示意图,如图11所示,本实施例的节点600包括处理器601、用户接口603、网络接口604和存储器605、发送器606和接收器607,存储器605可以包括操作系统6051、应用程序6052等。处理器601可以是中央处理器(Central Processing Unit,CPU)。存储器605用于存储可执行指令。处理器601可以执行存储器605中存储的可执行指令。其中,接收器607用于接收第二节点发送的第一心跳报文;所述处理器601用于判断在预设时间内所述接收器607是否接收到第二节点发送的第一心跳报文;所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;在所述处理器601判断出所述接收器607未接收到所述第二节点发送的第一心跳报文的情况下,发送器606用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息,所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文,所述第一节点为所述第二节点的邻居节点;所述接收器607还 用于接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;所述处理器601用于根据所述接收器607接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定是否所述其他邻居节点均未接收到所述第一心跳报文;在所述处理器601确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述处理器601还用于确定所述第二节点发生故障。
本实施例提供的节点,可以用于执行本发明任意实施例所提供的集群系统中节点的故障检测方法的技术方案,其实现原理和技术效果类似,此处不再赘述。
可选地,所述处理器601还用于生成第一投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;
所述接收器607还用于接收每一所述其他邻居节点发送的第二投票信息,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
所述处理器601还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
可选地,在所述处理器601根据所述接收器607接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,所述处理器601还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。
可选地,所述处理器601还用于根据所述第三节点的邻居节点和所述其他邻居节点中除所述第三节点之外的节点,重新确定所述第一节点的邻居节点。
本实施例提供的节点,可以用于执行本发明任意实施例所提供的集 群系统中节点的故障检测方法的技术方案,其实现原理和技术效果类似,此处不再赘述。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (12)

  1. 一种集群系统中节点的故障检测方法,其特征在于,包括:
    第一节点判断在预设时间内是否接收到第二节点发送的第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;
    在所述第一节点未接收到所述第二节点发送的第一心跳报文的情况下,所述第一节点向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息,所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文;
    所述第一节点接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;
    在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点确定所述第二节点发生故障。
  2. 根据权利要求1所述的方法,其特征在于,所述第一节点确定所述第二节点发生故障之后,还包括:
    所述第一节点生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
    所述第一节点根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
  3. 根据权利要求1或2所述的方法,其特征在于,还包括:
    在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,所述第一节点确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。
  4. 一种集群系统中节点的故障检测方法,其特征在于,所述方法包括:
    第二节点并行地向第一节点和其他邻居节点发送第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述其他邻居节点为所述第二节点的所有邻居节点中除所述第一节点之外的节点,所述其他邻居节点的数目为一个以上;
    所述第一节点判断在预设时间内是否接收到所述第一心跳报文;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;
    在所述第一节点未接收到所述第一心跳报文的情况下,所述第一节点向每一所述其他邻居节点分别发送请求消息,所述请求消息用于询问每一所述其他邻居节点是否接收到所述第一心跳报文;
    所述第一节点接收每一所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;
    在所述第一节点根据接收到的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点确定所述第二节点发生故障。
  5. 根据权利要求4所述的方法,其特征在于,所述第一节点确定所述第二节点发生故障之后,还包括:
    所述第一节点生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
    所述第一节点根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中 每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
  6. 根据权利要求4或5所述的方法,其特征在于,还包括:
    在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,则所述第一节点确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到第一心跳报文的节点。
  7. 一种集群系统中节点的故障检测装置,其特征在于,包括:
    判断模块,用于判断在预设时间内接收模块是否接收到第二节点发送的第一心跳报文;所述第一节点为所述第二节点的邻居节点,所述第一心跳报文为所述第二节点并行地向所述第二节点的每一邻居节点发送的一个心跳报文,所述第二节点的所有邻居节点的数目为两个以上;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;
    在所述判断模块判断出所述接收模块未接收到所述第二节点发送的第一心跳报文的情况下,
    发送模块,用于向所述第二节点的所有邻居节点中除所述第一节点之外的其他邻居节点发送请求消息,所述请求消息用于询问所述其他邻居节点是否接收到所述第一心跳报文;
    所述接收模块,还用于接收所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;
    确定模块,用于根据所述接收模块接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定是否所述其他邻居节点均未接收到所述第一心跳报文;
    在所述确定模块确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述确定模块,还用于确定所述第二节点发生故障。
  8. 根据权利要求7所述的装置,其特征在于,在所述确定模块确定 所述第二节点发生故障之后,还包括:
    生成模块,还用于生成第一投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识;
    所述接收模块,还用于接收每一所述其他邻居节点发送的第二投票信息,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
    所述确定模块,还用于根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点;所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
  9. 根据权利要求7或8所述的装置,其特征在于:
    在所述确定模块根据所述接收模块接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,
    所述确定模块还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。
  10. 一种集群系统中节点的故障检测系统,其特征在于,包括第一节点、第二节点和其他邻居节点,所述第一节点为所述第二节点的邻居节点,所述其他邻居节点为所述第二节点的所有邻居节点中除所述第一节点之外的节点,所述其他邻居节点的数目为一个以上,包括:
    所述第二节点,用于并行地向所述第一节点和所述其他邻居节点发送第一心跳报文;
    所述第一节点,用于判断在预设时间内是否接收到所述第一心跳报文;所述预设时间大于或等于一个心跳周期,且小于两个心跳周期;
    在所述第一节点未接收到所述第一心跳报文的情况下,所述第一节点还用于向每一所述其他邻居节点分别发送请求消息,所述请求消息用 于询问每一所述其他邻居节点是否接收到所述第一心跳报文;以及,所述第一节点还用于接收每一所述其他邻居节点发送的携带有接收状态的响应消息,所述接收状态用于表示是否接收到所述第一心跳报文;
    在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出所述其他邻居节点均未接收到所述第一心跳报文的情况下,所述第一节点还用于确定所述第二节点发生故障。
  11. 根据权利要求10所述的系统,其特征在于,所述第一节点确定所述第二节点发生故障之后,还包括:
    所述第一节点还用于:
    生成第一投票信息,并接收每一所述其他邻居节点发送的第二投票信息,所述第一投票信息包括所述第一节点选举的节点对应的节点标识,所述第二投票信息包括发送所述第二投票信息的邻居节点选举的节点对应的节点标识;
    以及,根据所述第一投票信息中的节点标识和每一所述其他邻居节点发送的第二投票信息中的节点标识,统计被选举的所有节点中每一节点获得的投票数量,并将投票数量最多的节点作为第三节点,所述第三节点为替代所述第二节点、且并行地向所述第三节点的所有邻居节点发送心跳报文的节点;所述第三节点的所有邻居节点包括所述第三节点自身的邻居节点和所述第二节点的邻居节点。
  12. 根据权利要求10或11所述的系统,其特征在于:
    在所述第一节点根据接收到的每一所述其他邻居节点发送的所述响应消息中携带的接收状态,确定出至少一个所述其他邻居节点接收到所述第一心跳报文的情况下,
    所述第一节点还用于确定未接收到所述第一心跳报文的节点与所述第二节点之间的链路发生故障;所述未接收到所述第一心跳报文的节点包括所述第一节点和所述其他邻居节点中未接收到所述第一心跳报文的节点。
PCT/CN2016/073606 2015-06-05 2016-02-05 集群系统中节点的故障检测方法和装置 WO2016192408A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510306800.0A CN106301853B (zh) 2015-06-05 2015-06-05 集群系统中节点的故障检测方法和装置
CN201510306800.0 2015-06-05

Publications (1)

Publication Number Publication Date
WO2016192408A1 true WO2016192408A1 (zh) 2016-12-08

Family

ID=57440098

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/073606 WO2016192408A1 (zh) 2015-06-05 2016-02-05 集群系统中节点的故障检测方法和装置

Country Status (2)

Country Link
CN (1) CN106301853B (zh)
WO (1) WO2016192408A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018214106A1 (zh) * 2017-05-25 2018-11-29 深圳市伊特利网络科技有限公司 网络连接列表的更新方法及系统
WO2019000954A1 (zh) * 2017-06-30 2019-01-03 中兴通讯股份有限公司 监测节点存活状态的方法、装置及系统
CN109302445A (zh) * 2018-08-14 2019-02-01 新华三云计算技术有限公司 主机节点状态确定方法、装置、主机节点及存储介质
US10547499B2 (en) 2017-09-04 2020-01-28 International Business Machines Corporation Software defined failure detection of many nodes
CN112468372A (zh) * 2017-04-10 2021-03-09 华为技术有限公司 电力线通信网络中设备状态检测方法和装置
CN113923105A (zh) * 2021-12-13 2022-01-11 中机联科技(广东)有限公司 一种基于区块链的物联网设备故障监控方法及系统

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108337274A (zh) * 2017-01-19 2018-07-27 贵州白山云科技有限公司 一种消息分发方法和系统
CN109428740B (zh) * 2017-08-21 2020-09-08 华为技术有限公司 设备故障恢复的方法和装置
CN109525408B (zh) * 2017-09-18 2021-12-21 杭州海康威视系统技术有限公司 一种设备异常处理方法、装置及云存储系统
CN107566219B (zh) * 2017-09-27 2020-09-18 华为技术有限公司 应用于集群系统的故障诊断方法、节点设备和计算机设备
CN107967291B (zh) * 2017-10-12 2019-08-13 腾讯科技(深圳)有限公司 日志条目复制方法、装置、计算机设备及存储介质
CN109714183A (zh) * 2017-10-26 2019-05-03 阿里巴巴集团控股有限公司 一种集群中的数据处理方法及装置
CN107864486A (zh) * 2017-12-26 2018-03-30 杭州迪普科技股份有限公司 一种离线ap检测方法和装置
CN108092857A (zh) * 2018-01-15 2018-05-29 郑州云海信息技术有限公司 一种分布式系统心跳检测方法及相关装置
CN110324166B (zh) * 2018-03-31 2020-12-15 华为技术有限公司 一种在多个节点中同步目标信息的方法、装置及系统
CN108683561B (zh) * 2018-05-16 2020-10-02 杭州迪普科技股份有限公司 一种站点状态检测方法及装置
CN109218141A (zh) * 2018-11-20 2019-01-15 郑州云海信息技术有限公司 一种故障节点检测方法及相关装置
CN109873719B (zh) * 2019-02-03 2019-12-31 华为技术有限公司 一种故障检测方法及装置
US11265080B2 (en) 2019-04-29 2022-03-01 Hmn Technologies Co., Limited Submarine cable fault determining method and apparatus
CN110380934B (zh) * 2019-07-23 2021-11-02 南京航空航天大学 一种分布式余度系统心跳检测方法
CN111181763A (zh) * 2019-11-28 2020-05-19 泰康保险集团股份有限公司 一种网络报障方法和装置
CN112911520B (zh) * 2019-12-04 2022-05-31 哈尔滨海能达科技有限公司 自组网中确定主节点的方法、装置及存储介质
CN111586110B (zh) * 2020-04-22 2021-03-19 广州锦行网络科技有限公司 一种raft在出现点对点故障时的优化处理方法
CN112398905B (zh) * 2020-09-28 2022-05-31 联想(北京)有限公司 一种节点及信息同步方法
CN112988463B (zh) * 2021-02-23 2022-08-30 新华三大数据技术有限公司 一种故障节点隔离方法及装置
CN113542052A (zh) * 2021-06-07 2021-10-22 新华三信息技术有限公司 一种节点故障确定方法、装置和服务器
CN113783735A (zh) * 2021-09-24 2021-12-10 小红书科技有限公司 Redis集群中故障节点的识别方法、装置、设备和介质
CN115102886A (zh) * 2022-06-21 2022-09-23 上海驻云信息科技有限公司 一种多个采集客户端的任务调度方法及装置
CN116260705B (zh) * 2022-12-21 2023-09-15 广西壮族自治区自然资源信息中心 地理信息分布式集群故障处理方法、装置、介质及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102752143A (zh) * 2012-07-05 2012-10-24 杭州华三通信技术有限公司 Mpls te双向隧道的bfd检测方法及路由设备
CN103916275A (zh) * 2014-03-31 2014-07-09 杭州华三通信技术有限公司 一种bfd检测装置和方法
US20140301401A1 (en) * 2013-04-07 2014-10-09 Hangzhou H3C Technologies Co., Ltd. Providing aggregation link groups in logical network device
CN104283711A (zh) * 2014-09-29 2015-01-14 中国联合网络通信集团有限公司 基于双向转发检测bfd的故障检测方法、节点及系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294596A1 (en) * 2006-05-22 2007-12-20 Gissel Thomas R Inter-tier failure detection using central aggregation point
CN101159536B (zh) * 2007-10-30 2010-09-29 中兴通讯股份有限公司 双归属网络中媒体网关节点状态同步的方法
CN102204169A (zh) * 2011-05-12 2011-09-28 华为技术有限公司 故障检测方法、路由节点及系统
CN103297396B (zh) * 2012-02-28 2016-05-18 国际商业机器公司 群集系统中管理故障转移的装置和方法
CN102612110B (zh) * 2012-03-02 2014-11-05 浙江大学 一种电力载波照明控制系统中的分布式自组织路由方法
CN102821011A (zh) * 2012-08-28 2012-12-12 北京星网锐捷网络技术有限公司 对端状态检测方法、装置及设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102752143A (zh) * 2012-07-05 2012-10-24 杭州华三通信技术有限公司 Mpls te双向隧道的bfd检测方法及路由设备
US20140301401A1 (en) * 2013-04-07 2014-10-09 Hangzhou H3C Technologies Co., Ltd. Providing aggregation link groups in logical network device
CN103916275A (zh) * 2014-03-31 2014-07-09 杭州华三通信技术有限公司 一种bfd检测装置和方法
CN104283711A (zh) * 2014-09-29 2015-01-14 中国联合网络通信集团有限公司 基于双向转发检测bfd的故障检测方法、节点及系统

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112468372A (zh) * 2017-04-10 2021-03-09 华为技术有限公司 电力线通信网络中设备状态检测方法和装置
CN112468372B (zh) * 2017-04-10 2023-10-13 华为技术有限公司 电力线通信网络中设备状态检测方法和装置
WO2018214106A1 (zh) * 2017-05-25 2018-11-29 深圳市伊特利网络科技有限公司 网络连接列表的更新方法及系统
WO2019000954A1 (zh) * 2017-06-30 2019-01-03 中兴通讯股份有限公司 监测节点存活状态的方法、装置及系统
US11212204B2 (en) 2017-06-30 2021-12-28 Xi'an Zhongxing New Software Co., Ltd. Method, device and system for monitoring node survival state
US10547499B2 (en) 2017-09-04 2020-01-28 International Business Machines Corporation Software defined failure detection of many nodes
CN109302445A (zh) * 2018-08-14 2019-02-01 新华三云计算技术有限公司 主机节点状态确定方法、装置、主机节点及存储介质
CN109302445B (zh) * 2018-08-14 2021-10-12 新华三云计算技术有限公司 主机节点状态确定方法、装置、主机节点及存储介质
CN113923105A (zh) * 2021-12-13 2022-01-11 中机联科技(广东)有限公司 一种基于区块链的物联网设备故障监控方法及系统
CN113923105B (zh) * 2021-12-13 2022-04-22 中机联科技(广东)有限公司 一种基于区块链的物联网设备故障监控方法及系统

Also Published As

Publication number Publication date
CN106301853B (zh) 2019-06-18
CN106301853A (zh) 2017-01-04

Similar Documents

Publication Publication Date Title
WO2016192408A1 (zh) 集群系统中节点的故障检测方法和装置
EP3092752B1 (en) Multi-master selection in a software defined network
CN108833202B (zh) 故障链路检测方法、装置和计算机可读存储介质
US8756453B2 (en) Communication system with diagnostic capabilities
US9647921B2 (en) Statistics and failure detection in a network on a chip (NoC) network
US10862786B2 (en) Method and device for fingerprint based status detection in a distributed processing system
US10277454B2 (en) Handling failure of stacking system
RU2610250C2 (ru) Узел передачи и способ отчетности о статусе буфера
US20180288152A1 (en) Storage dynamic accessibility mechanism method and apparatus
US11283907B2 (en) Determining state of virtual router instance
US11296982B2 (en) Initiator-based data-plane validation for segment routed, multiprotocol label switched (MPLS) networks
WO2017012383A1 (zh) 一种服务注册方法、使用方法及相关装置
Fang et al. A fast and load-aware controller failover mechanism for software-defined networks
JP7458424B2 (ja) 性能ルーティング測定を伴う双方向フォワーディング検出を提供するためのシステムおよび方法
US10735248B2 (en) Cloudified N-way routing protection at hyper scale
US11916768B2 (en) Information sharing method and apparatus in redundancy network, and computer storage medium
WO2017101016A1 (zh) 用于存储节点同步业务请求的方法和装置
US9118546B2 (en) Data forwarding method and router
Wang et al. Churn-tolerant leader election protocols
US10320954B2 (en) Diffusing packets to identify faulty network apparatuses in multipath inter-data center networks
US9398487B2 (en) System and method for management of network links by traffic type
CN117319507A (zh) 一种路由连接方法、装置、电子设备及存储介质
Schiff et al. Renaissance: A Self-Stabilizing Distributed SDN Control Plane using In-band Communications
JP5579637B2 (ja) 通信システム、通信装置および接続状態検出方法
CN118101516A (zh) 一种报文传输方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16802331

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16802331

Country of ref document: EP

Kind code of ref document: A1