CN113542052A

CN113542052A - Node fault determination method and device and server

Info

Publication number: CN113542052A
Application number: CN202110629101.5A
Authority: CN
Inventors: 曾军
Original assignee: New H3C Information Technologies Co Ltd
Current assignee: New H3C Information Technologies Co Ltd
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-10-22

Abstract

The specification provides a node fault determination method, a node fault determination device and a server, and relates to the technical field of communication. A node fault determination method is applied to cluster nodes and comprises the following steps: sending a detection message to the outside through at least two ports, and receiving feedback messages sent by other cluster nodes in the cluster aiming at the detection message; if a feedback message sent by one port of other cluster nodes is not received in a preset period, determining that the port has a fault; if all ports of one other cluster node are determined to have faults, determining that the other cluster node is abnormal; and if the election identification of the node per se meets the preset condition, the node serves as a monitoring node to alarm other failed cluster nodes. By the method, the reliability of fault detection can be improved.

Description

Node fault determination method and device and server

Technical Field

The present disclosure relates to the field of communications technologies, and in particular, to a method, an apparatus, and a server for determining a node fault.

Background

The application of big data is increasingly wide, the scale of a distributed system serving mass data is continuously enlarged, and how to keep the environment of the distributed system stable and reliable becomes more and more challenging. The distributed system utilizes a plurality of servers to cooperatively work, solves the problems of calculation, storage, transmission and the like which cannot be solved by a single server, and is the most common form of the distributed system such as a cluster, wherein a plurality of servers are deployed in one cluster to serve as cluster nodes. Each server acts as a node in the distributed system and handles a portion of the tasks carried by the distributed system.

In the cluster, in order to determine the working state of each cluster node, each cluster node interacts with a heartbeat message through a set port, and one node in the cluster is used as a monitoring node to collect and alarm other nodes in the cluster. However, a cluster node in a cluster generally has a plurality of ports for data interaction, and if a port for interacting a heartbeat message fails and other ports work normally, the cluster node also cannot send the heartbeat message to the outside and is considered as a failed cluster node, so that a problem of false alarm occurs, and reliability of fault detection of the cluster is reduced.

Disclosure of Invention

In order to overcome the problems in the related art, the present specification provides a node failure determination method, an apparatus, and a server.

In combination with the first aspect of the embodiments of the present specification, the present application provides a node fault determination method, applied to a cluster node, including:

sending a detection message to the outside through at least two ports, and receiving feedback messages sent by other cluster nodes in the cluster aiming at the detection message;

if a feedback message sent by one port of other cluster nodes is not received in a preset period, determining that the port has a fault;

if all ports of one other cluster node are determined to have faults, determining that the other cluster node is abnormal;

and if the election identification of the node per se meets the preset condition, the node serves as a monitoring node to alarm other failed cluster nodes.

Optionally, the election identifier is an IP address, and the preset condition is an IP address maximum value or an IP address minimum value; or,

the election mark is a device mark, and the preset condition is a device mark maximum value or a device mark minimum value.

Optionally, after sending the detection packet to the outside through at least two ports and receiving the feedback packet sent by other cluster nodes in the cluster for the detection packet, the method further includes:

and if the at least two ports do not receive the feedback messages sent by other cluster nodes, determining that the ports have faults.

Optionally, the sending of the detection packet to the outside through at least two ports, and receiving of the feedback packet sent by other cluster nodes in the cluster for the detection packet, include:

generating a detection message through at least two threads, wherein one thread corresponds to one port;

based on a thread, sending the generated detection message to other cluster nodes in the cluster through a port corresponding to the thread;

and receiving feedback messages sent by other cluster nodes in the cluster through a port corresponding to a thread based on the thread.

Optionally, the detection message and the feedback message are generated based on an ICMP protocol.

In combination with the second aspect of the embodiments of the present specification, the present application provides a node fault determination apparatus, which is applied to a cluster node, and includes:

the interaction unit is used for sending the detection message to the outside through at least two ports and receiving feedback messages sent by other cluster nodes in the cluster aiming at the detection message;

the port fault detection unit is used for determining that a port has a fault if a feedback message sent by the port is not received in a preset period;

the node fault detection unit is used for determining that other cluster nodes are abnormal if all ports of the other cluster nodes are determined to have faults;

and the alarm unit is used for alarming other failed cluster nodes as the monitoring node if the election identification of the alarm unit per se meets the preset condition.

Optionally, the election identifier is an internet protocol IP address, and the preset condition is an IP address maximum value or an IP address minimum value; or,

Optionally, the apparatus further includes:

and the self-checking unit is used for determining that the self-checking unit has a fault if the at least two ports do not receive the feedback messages sent by other cluster nodes.

In combination with the three aspects of the embodiments of the present specification, the present application provides a server, applied in a cluster, including: a processor, a machine-readable storage medium, and at least two ports;

a machine-readable storage medium stores machine-executable instructions executable by a processor, the processor being caused by the machine-executable instructions to: implementing any of the above method steps.

The technical scheme provided by the implementation mode of the specification can have the following beneficial effects:

in the embodiment of the present specification, a detection packet is sent through at least two ports on a cluster node, and a feedback packet sent by other cluster nodes for the sent detection packet is received, and in a preset period, if the feedback packet is not received, it is considered that a port fault occurs on other cluster nodes, and then when it is determined that all ports on one other cluster node have faults, it is determined that the whole other cluster node has faults, and an alarm is performed, so that a problem that all services of the other cluster node are interrupted due to the fact that the cluster node serving as a monitoring node determines that the other cluster node has a fault when only one port of the other cluster node has a fault is avoided, and reliability of cluster fault detection is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

FIG. 1 is a flow chart of a node fault determination method to which the present application is directed;

FIG. 2 is a schematic diagram of a cluster to which the present application relates;

fig. 3 is a schematic diagram illustrating interaction between a detection packet and a feedback packet in a node fault determination method according to the present application;

fig. 4 is a schematic structural diagram of a node failure determination apparatus according to the present application;

fig. 5 is a schematic diagram of a server according to the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification.

The application provides a node fault determination method, which is applied to cluster nodes, and as shown in fig. 1, the method includes:

s100, sending a detection message to the outside through at least two ports, and receiving feedback messages sent by other cluster nodes in the cluster aiming at the detection message.

As shown in fig. 2, a cluster in a network includes a plurality of cluster nodes, each cluster node is provided with at least two ports, and each port belongs to a different network card, a different network segment, and a different network device (e.g., a router or a switch). Subsequently, description will be given by taking 3 cluster nodes (cluster node 1, cluster node 2, and cluster node 3) in a cluster, each cluster node being provided with a plurality of network cards (cluster node 1 includes network card 10 and network card 11, cluster node 2 includes network card 20 and network card 21, cluster node 3 includes network card 30 and network card 31), each network card may include one port (network card 10 includes port 100, network card 11 includes port 110, network card 20 includes port 200, network card 21 includes port 210, network card 30 includes port 300, and network card 31 includes port 310). It should be noted that the foregoing is merely an example, and does not limit the cluster nodes included in the cluster, the number of network cards included in each cluster node, and the number of ports set on the network cards.

After the cluster is started, the platform software running on the cluster nodes may release address information of each cluster node according to the address information, such as an IP (Internet Protocol) address and/or a Media Access Control (MAC) address, deployed in the cluster. After each cluster node receives the IP addresses of other cluster nodes, a thread can be created to generate a detection message of a sub-network segment by taking a port as a reference, and the detection message is forwarded to the outside through the set port. In the cluster shown in fig. 2, the cluster node 1 may generate two detection messages, and send the two detection messages to the port 200 of the cluster node 2 and the port 300 of the cluster node 3 connected to the network device 1 through the port 100 of the network card 10, and the cluster node 1 may also generate another two detection messages, and send the two detection messages to the port 210 of the cluster node 2 and the port 310 of the cluster node 3 connected to the network device 2 through the port 110 of the network card 11.

When other cluster nodes in the cluster receive the detection message, the detection message is analyzed, the detection message is constructed to a corresponding feedback message, and the feedback message is returned to the cluster nodes through the port for receiving the detection message. Therefore, one cluster node can detect whether one port of another cluster node fails.

Optionally, the detection Message and the feedback Message are generated based on an ICMP Protocol (Control Message Protocol), that is, the detection Message may be a Ping Message in the ICMP Protocol, and the feedback Message may be a Ping echo Message in the ICMP Protocol. Of course, the detection message and the feedback message may also be generated based on other communication protocols, such as KeepAlive message in TCP/IP Protocol (Transmission Control/network Protocol/Internet Protocol), and may also be heartbeat message in a cluster.

S101, if a feedback message sent by one port of other cluster nodes is not received in a preset period, determining that the port has a fault.

For example, the cluster node 1 sends the detection message to the outside through the port set on the network card, as shown in fig. 3. The cluster node 1 sends out the detection message through the port 100 on the network card 10 and the port 110 on the network card 11, starts timing respectively for the sent detection message, and assumes that the detection message reaches the cluster node 2 and the cluster node 3 respectively. The cluster node 2 and the cluster node 3 analyze the detection message respectively and construct a feedback message corresponding to the detection message. At this time, if the port 210 on the cluster node 2 fails, the detection message cannot be received, and then the cluster node 2 may return a feedback message to the port 100 through the port 200 on the network card 20.

The cluster node 1 may receive the feedback packet sent by the port 200, determine that the port 200 of the cluster node 2 is normal, but cannot receive the feedback packet sent by the port 210 all the time, and determine that the port 210 fails.

For the cluster node 3, assuming that the power is down, the port 300 and the port 310 are both failed, and cannot receive the detection packet, that is, cannot send the feedback packet to the cluster node 1. At this point, the timer on cluster node 1 started for port 300 and port 310 will time out, thereby determining that both ports on cluster node 3 have failed. In fig. 3, the ports marked with white are indicated as working ports, and the ports marked with black are indicated as failure ports.

S102, if all the ports of one other cluster node are determined to be in fault, determining that the other cluster node is abnormal.

At this time, the cluster node 1 may determine that the cluster node 2 is in a working state because the port 200 is working normally and the cluster node 2 has a normal port.

Cluster node 1 may determine that port 300 and port 310 on cluster node 3 are failed, and that cluster node 3 has no normal ports and is in an abnormal state.

Finally, the cluster node 1 may determine that the cluster node 2 is in a normal state and the cluster node 3 is in an abnormal state.

S103, if the election identification of the node per se meets the preset condition, the node serves as a monitoring node to alarm other failed cluster nodes.

When the cluster node 1 determines that the cluster node 2 is in a normal state and the cluster node 3 is in an abnormal state, the cluster node 1 may determine whether the preset condition is met according to the election identification of the cluster node 1.

Optionally, the election identifier is an IP address, and the preset condition is an IP address maximum value or an IP address minimum value; or, the election identifier is an equipment identifier, and the preset condition is an equipment identifier maximum value or an equipment identifier minimum value. The election identifier is used to select one cluster node from a plurality of cluster nodes of the cluster as a monitoring node, and the monitoring node may report the cluster node in a fault state to platform software configured in the cluster, so that the platform software can manage the cluster nodes in the cluster based on the reported state, for example, isolate the cluster nodes in an abnormal state and the like.

For example, when an IP address is used as the election identifier, the minimum value of the IP address is used as the preset condition. The IP address of cluster node 1 is 192.168.1.4, the IP address of cluster node 2 is 192.168.1.5, and the IP address of cluster node 3 is 192.168.1.6. The comparison can show that the IP address of the cluster node 1 is the minimum value of the IP addresses of the cluster nodes in the cluster, so that the cluster node 1 can be used as a monitoring node to report the fault of the cluster node 3 to the platform software, so that the cluster node 3 is isolated.

Of course, the cluster node 1 may also report the port 210 failure of the cluster node 2 to prompt the worker to check the port 210 of the cluster node 2.

In addition, the election identifier may also be an identifier of another node such as a MAC address, which is not limited to this.

In addition, in the foregoing process, when the cluster node can send a feedback packet for the detection packet, the method further includes: and S104, if the feedback message sent by one port of other cluster nodes is received in the preset period, determining that the port does not have a fault.

Optionally, in step S100, after sending the detection packet to the outside through at least two ports and receiving the feedback packet sent by other cluster nodes in the cluster for the detection packet, the method further includes:

s105, if the at least two ports do not receive the feedback messages sent by other cluster nodes, determining that the ports are in fault.

After the cluster node 1 sends out the detection message, after the timing is overtime, if no other cluster node in the cluster sends a feedback message to the cluster node 1, the cluster node can be considered to have a fault. At this time, if the cluster node 1 can still interact with the platform software, the cluster node 1 may be notified of its own fault, so that the platform software can isolate the cluster node 1.

Optionally, step S100, sending a detection packet to the outside through at least two ports, and receiving a feedback packet sent by other cluster nodes in the cluster for the detection packet, includes:

S100A, generating detection messages through at least two threads.

S100B, based on a thread, sends the generated detection packet to other cluster nodes in the cluster through the port corresponding to the thread.

In the process of detecting other cluster nodes in the cluster by the cluster node, possibly due to the fact that the service processed by the cluster node and the detection share the same thread, at this time, if the service processing amount of the cluster node is large, the processor can continuously process the service message within a period of time, so that the problem that the detection message cannot be processed by the cluster node although being received by the cluster node is solved, and the efficiency of determining the fault by the cluster node is reduced.

At this time, the cluster node 1 may create at least two threads in the processor, where each thread is created for one port and is independent from the thread enabled by the service, that is, one thread corresponds to one port. At this time, for one thread, a plurality of detection messages that need to be sent to the outside through the port may be constructed, and these detection messages are sent to other cluster nodes in the cluster through the same port and the same network device.

S100C, based on a thread, receives a feedback packet sent by another cluster node in the cluster through a port corresponding to the thread.

And the cluster node receiving the detection message constructs a feedback message and sends back the feedback message. At this time, the cluster node 1 may receive a feedback packet sent back by other cluster nodes for the detection packet on the port. The feedback message is received and processed through the threads independently set for the ports, so that the cluster node 1 can avoid the problem that the detection flow is blocked by the service due to the service and the detection common thread, and the efficiency of fault detection of the cluster node is improved.

Correspondingly, the present application provides a node failure determining apparatus, applied to a cluster node, as shown in fig. 4, including:

Optionally, the apparatus further includes:

Correspondingly, the present application provides a server, applied in a cluster, as shown in fig. 5, including: a processor, a machine-readable storage medium, and at least two ports;

It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof.

The above description is only for the purpose of illustrating the preferred embodiments of the present disclosure and is not to be construed as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A node fault determination method is applied to cluster nodes and comprises the following steps:

sending a detection message to the outside through at least two ports, and receiving feedback messages sent by other cluster nodes in a cluster aiming at the detection message;

2. The method of claim 1, wherein the election identifier is an Internet Protocol (IP) address, and the preset condition is an IP address maximum value or an IP address minimum value; or,

the election identification is an equipment identification, and the preset condition is an equipment identification maximum value or an equipment identification minimum value.

3. The method according to claim 1, wherein after the sending out the detection packet through the at least two ports and receiving the feedback packet sent by other cluster nodes in the cluster for the detection packet, the method further comprises:

4. The method of claim 1, wherein the sending the detection packet outside through the at least two ports and receiving the feedback packet sent by other cluster nodes in the cluster for the detection packet comprises:

5. The method according to any of claims 1-4, wherein the detection message and the feedback message are generated based on ICMP protocol.

6. A node fault determination device is applied to cluster nodes and comprises the following components:

the interaction unit is used for sending a detection message to the outside through at least two ports and receiving feedback messages sent by other cluster nodes in the cluster aiming at the detection message;

7. The apparatus of claim 6, wherein the election identifier is an Internet Protocol (IP) address, and the preset condition is a maximum IP address value or a minimum IP address value; or,

8. The apparatus of claim 6, further comprising:

9. The apparatus according to claim 6, wherein the sending out the detection packet through the at least two ports and receiving the feedback packet sent by other cluster nodes in the cluster for the detection packet comprises:

10. The apparatus according to any of claims 6-9, wherein the detection message and the feedback message are generated based on an ICMP protocol.

11. A server, applied in a cluster, comprising: a processor, a machine-readable storage medium, and at least two ports;

the machine-readable storage medium stores machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to: carrying out the method steps of any one of claims 1 to 5.