CN116684256B - Node fault monitoring method, device and system, electronic equipment and storage medium - Google Patents

Node fault monitoring method, device and system, electronic equipment and storage medium Download PDF

Info

Publication number
CN116684256B
CN116684256B CN202310955919.5A CN202310955919A CN116684256B CN 116684256 B CN116684256 B CN 116684256B CN 202310955919 A CN202310955919 A CN 202310955919A CN 116684256 B CN116684256 B CN 116684256B
Authority
CN
China
Prior art keywords
node
communication state
current
network communication
heartbeat
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310955919.5A
Other languages
Chinese (zh)
Other versions
CN116684256A (en
Inventor
张烨
贺计文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310955919.5A priority Critical patent/CN116684256B/en
Publication of CN116684256A publication Critical patent/CN116684256A/en
Application granted granted Critical
Publication of CN116684256B publication Critical patent/CN116684256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a node fault monitoring method, a device, a system, electronic equipment and a storage medium, and relates to the technical field of computers, wherein the method comprises the following steps: sending a first heartbeat message to a second node in the distributed cluster system; receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and a current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message; and acquiring a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table. The invention realizes accurate analysis of the fault node in the sub-health state of the network, prevents the normal node from being subjected to actions of fault switching and fault recovery caused by misjudgment, thereby improving the stability and reliability of node detection and further improving the stability, safety and reliability of the cluster.

Description

Node fault monitoring method, device and system, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a system, an electronic device, and a storage medium for monitoring node failures.
Background
The distributed cluster system is a cluster formed by a plurality of node servers, each node runs a processing program, and when the network state of one or a plurality of nodes is in a fault state, the performance of the whole distributed cluster system is affected. Therefore, how to efficiently and accurately monitor the failed node is an important issue to be solved in the industry.
In the related art, in general, by using PING (Packet Internet Groper, internet packet explorer) or heartbeat monitoring, whether other nodes send response information to the node within a preset duration is determined by point-to-point, so as to determine whether other network nodes are abnormal nodes, and in a state of sub-health of the network, because the network connection state is unstable, CTDBs (Cluster Trivial Database ) in nodes with abnormal network exist, if the response information transmitted by other nodes is monitored by PING or heartbeat to be lost, faults of other nodes are mistakenly considered, so that the node fault detection precision is low, and the stability and reliability of the cluster system are further affected.
Disclosure of Invention
The invention provides a node fault monitoring method, a device, a system, electronic equipment and a storage medium, which are used for solving the defects that in the prior art, the node fault detection precision is low, and further the stability and reliability of a cluster are affected, and improving the node fault detection precision, thereby improving the stability and reliability of a cluster system.
The invention provides a node fault monitoring method, which is applied to a first node in a distributed cluster system and comprises the following steps:
sending a first heartbeat message to a second node in the distributed cluster system;
receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and a current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message;
and acquiring a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table.
According to the method for monitoring the node fault provided by the invention, the fault monitoring result of the second node is obtained according to the current heartbeat timeout times and the current network connection state table, and the method comprises the following steps:
Comparing the current heartbeat timeout times with a frequency threshold value to obtain a first comparison result;
judging whether a network communication state between at least one third node and the second node in the distributed cluster system is a normal state according to the current network communication state table under the condition that the current heartbeat timeout times are larger than the times threshold according to the first comparison result;
obtaining a fault monitoring result of the second node according to the judging result;
the third node is a network node except the first node and the second node in the distributed cluster system.
According to the node fault monitoring method provided by the invention, the fault monitoring result of the second node is obtained according to the judging result, and the method comprises the following steps:
and under the condition that the network communication state between at least one third node and the second node is not normal in the distributed cluster system according to the judging result, determining that the fault monitoring result of the second node is a fault state.
According to the node fault monitoring method provided by the invention, the fault monitoring result of the second node is obtained according to the judging result, and the method comprises the following steps:
Acquiring the number of referenceable nodes corresponding to the second node under the condition that the network communication state between at least one third node and the second node in the distributed cluster system is determined to be in a normal state according to the judging result;
obtaining a fault monitoring result of the second node according to the number of the referenceable nodes;
the referenceable node is configured to provide a response message for updating the current network connection status table of the second node in a preset period.
According to the method for monitoring node faults provided by the invention, the method for monitoring the faults of the second node according to the number of the referenceable nodes comprises the following steps:
comparing the number of the referenceable nodes with a number threshold to obtain a second comparison result;
and under the condition that the number of the referenceable nodes is larger than the number threshold according to the second comparison result, determining that the fault monitoring result of the second node is in a normal state.
According to the node fault monitoring method provided by the invention, the method further comprises the following steps:
triggering an isolation action if it is determined that the number of referenceable nodes is greater than the number threshold based on the second comparison result;
The isolation action is used for isolating the first node from other network nodes except the first node in the distributed cluster system or isolating the network port of the first node from the network ports of other network nodes.
According to the node fault monitoring method provided by the invention, the method further comprises the following steps:
and determining that the fault monitoring result of the second node is a fault state under the condition that the number of the referenceable nodes is less than or equal to the number threshold according to the second comparison result.
According to the node fault monitoring method provided by the invention, the method further comprises the following steps:
under the condition that the fault monitoring result of the second node is determined to be in a fault state, a fourth node is obtained in the distributed cluster system; the fourth node is a network node with the same service function as the second node and the fault monitoring result is in a normal state;
migrating the task to be processed of the second node to the fourth node;
and under the condition that the fault monitoring result of the second node is switched from the fault state to the normal state, recovering the task to be processed to the second node.
According to the method for monitoring node faults provided by the invention, the current heartbeat timeout times with the second node and the current network communication state table of the second node are obtained according to the second heartbeat message, and the method comprises the following steps:
analyzing the second heartbeat message to obtain the current network communication state table;
determining a current network communication state between the second node and the current network communication state table according to the current network communication state table;
updating the count value of the heartbeat timeout counter according to the current network connection state;
and acquiring the current heartbeat timeout times according to the updated count value.
According to the node fault monitoring method provided by the invention, the count value of the heartbeat timeout counter is updated according to the current network connection state, and the method comprises the following steps:
and under the condition that the current network communication state is determined to be an abnormal communication state, adding 1 to the count value of the heartbeat timeout counter.
According to the node fault monitoring method provided by the invention, the count value of the heartbeat timeout counter is updated according to the current network connection state, and the method comprises the following steps:
And under the condition that the current network communication state is determined to be a normal communication state, keeping the count value of the heartbeat timeout counter unchanged.
According to the method for monitoring node faults, the current network communication state between the node and the second node is determined according to the current network communication state table, and the method comprises the following steps:
searching the communication information between the current network communication state table and the second node;
and under the condition that the searching result is empty, determining that the current network communication state is an abnormal communication state.
According to the node fault monitoring method provided by the invention, the method further comprises the following steps:
if the connection information is found according to the search result, determining whether the connection with the second node is disconnected according to the connection information;
and under the condition that disconnection with the second node is determined, determining that the current network communication state is an abnormal communication state.
According to the node fault monitoring method provided by the invention, the method further comprises the following steps:
and under the condition of determining normal connection with the second node, determining that the current network communication state is a normal communication state.
According to the method for monitoring node faults provided by the invention, the step of sending the first heartbeat message to the second node in the distributed cluster system comprises the following steps:
generating a target network communication state table according to the network communication state between the target network communication state table and each network node in the distributed cluster system;
generating the first heartbeat message according to the target network communication state table;
and sending the first heartbeat message to the second node under the condition that the time interval between the current time and the last sending time meets a time interval threshold value.
According to the node fault monitoring method provided by the invention, the method further comprises the following steps:
updating the target network communication state table according to the current network communication state table;
acquiring a target fault monitoring result according to the updated target network communication state table and the current heartbeat timeout times; and the target fault monitoring result is the fault monitoring result of the first node.
The invention also provides a node fault monitoring device, which is applied to a first node in the distributed cluster system and comprises:
a sending thread, configured to send a first heartbeat packet to a second node in the distributed cluster system;
A receiving thread, configured to receive a second heartbeat message returned by the second node, and obtain, according to the second heartbeat message, a current heartbeat timeout number between the second node and the second node, and a current network connection state table of the second node; the second heartbeat message is a response message of the first heartbeat message;
and the detection thread is used for acquiring the fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table.
The invention also provides a node fault monitoring system, which comprises a distributed cluster system;
the distributed cluster system comprises a first node, a plurality of second nodes and a cluster trivial database;
the cluster trivial database is used for providing network communication state detection service for the first node and the second node;
the first node is configured to perform a node failure monitoring method according to any of the preceding claims.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the node fault monitoring method as described above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a node fault monitoring method as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a node fault monitoring method as described in any of the above.
According to the node fault monitoring method, device, system, electronic equipment and storage medium, the first node sends the first heartbeat message to the second node and receives the second heartbeat message sent by the second node, so that the current network communication state table between the second node and each network node is synchronized, the current heartbeat timeout times are further obtained according to the current network communication state table, and the fault monitoring result of the second node is jointly obtained according to the current network communication state table and the current heartbeat timeout times, so that the CTDB accurately analyzes the fault node in the sub-health state of the network, the actions of performing fault switching and fault recovery on the normal node caused by misjudgment are prevented, the stability and reliability of node detection are improved, and the stability, the safety and the reliability of the cluster are further improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a node fault monitoring method provided by the present invention;
FIG. 2 is a second flow chart of the node failure monitoring method according to the present invention;
FIG. 3 is a third flow chart of the node failure monitoring method according to the present invention;
FIG. 4 is a flow chart of a method for monitoring node failure according to the present invention;
fig. 5 is a timing diagram of interactions between network nodes provided by the present invention;
fig. 6 is a schematic structural diagram of a node fault monitoring device provided by the present invention;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The state which can normally enable the network to normally operate and quickly recover after being impacted by the outside is called a network health state; and the state that the network falls into paralysis and cannot normally run is called an unhealthy state. The network can normally operate, but the capability of resisting risks is extremely low, paralysis is easily caused under the condition of sudden network risks, and the state which is difficult to recover for a long time is called a sub-health state of the network. Generally, networks of many large and medium-sized enterprises are in a sub-health state, so how to efficiently and accurately monitor network nodes in the sub-health state is an important issue to be solved in the industry.
In the related art, CTDB service is generally adopted to determine whether other nodes are abnormal nodes by determining whether other nodes send response information to the node through PING or heartbeat monitoring. However, because the CTDB lacks logic for judging the sub-health state of the network port, in the sub-health state of the network, the CTDB in the node with network abnormality monitors that the response information transmitted by other nodes is lost through PING or heartbeat, and the other nodes are mistakenly considered to have faults, so that the fault switching and recovery actions of the other nodes are triggered, and the stability and reliability of the cluster are affected. Therefore, in a state of sub-health of the network, the CTDB has a problem of misjudging a failed node, so that the node failure detection accuracy is low, thereby affecting the stability and reliability of the cluster system.
Based on the above problems, the embodiments of the present application provide a method, an apparatus, a system, an electronic device, and a storage medium for monitoring node failures of CTDB in a distributed cluster system in a sub-health state of a network, where the method facilitates synchronization of network communication state records between nodes by transmitting heartbeat messages between nodes, and accurately obtains a network communication state table and heartbeat timeout times according to analysis of the heartbeat messages, and further performs multiple repeated judgment according to the network communication state table and the heartbeat timeout times, so that the CTDB accurately analyzes failed nodes in a sub-health state of the network, and prevents actions of performing failover and failure recovery on normal nodes due to erroneous judgment, thereby improving stability and reliability of node detection, and further improving stability, security, and reliability of the cluster.
The node failure monitoring method of the present application is described below in conjunction with fig. 1-5.
Fig. 1 is a schematic flow chart of a node fault monitoring method according to an embodiment of the present application, where the method may be applied to a distributed cluster system including a first node and a plurality of second nodes. The connection between the nodes in the distributed cluster system can be wireless connection established based on wireless fidelity (Wireless Fidelity, WIFI), bluetooth and other technologies, or wired connection established through universal serial bus and other technologies, and the embodiment of the application does not limit the connection mode between the nodes.
Each node has the same CTDB service in operation, namely the CTDB service is distributed service, so that the network state is mutually detected among the nodes, and the specific detection method is to send a heartbeat message or a network sub-health detection message.
Each node comprises at least three threads, namely a sending thread, a receiving thread and a detecting thread; the sending thread is used for sending information, such as heartbeat messages, to the opposite terminal; the receiving thread is used for receiving information transmitted by the opposite terminal, such as a heartbeat message and the like, and the detecting thread is used for detecting faults of the local terminal node or the opposite terminal node. It should be noted that the at least three threads may run synchronously or asynchronously, which is not specifically limited in this embodiment.
The node may be an electronic device. The electronic device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a cell phone, tablet computer, notebook computer, palm computer, vehicle mounted electronic device, wearable device, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., without limitation of the present invention.
It will be appreciated that, based on the connection between the nodes, communication may be performed between the nodes, specifically, heartbeat packets may be transmitted between the nodes to perform node fault monitoring, or service data may be transmitted between the nodes to perform corresponding service processing, which is not limited in this embodiment specifically.
The first node is a network node currently executing node fault monitoring, and can be determined in the distributed cluster system randomly or according to a preset rule; the second node is one or more network nodes in the distributed cluster system that need to perform fault monitoring in addition to the first node. The node fault monitoring method provided in this embodiment is described below by taking the number of the second nodes as one example, and when the number of the second nodes is multiple, fault monitoring can be performed on other second nodes by referring to this mode.
The execution body of the embodiment is a first node, as shown in fig. 1, and the method specifically includes the following steps:
step 101, a first heartbeat message is sent to a second node in the distributed cluster system;
optionally, the first node generates a first heartbeat message in real time, and sends the first heartbeat message to a second node in the distributed cluster system.
The first heartbeat message may be generated based on a node communication state test instruction, or may be generated based on a network communication state table between each network node and the first node in the distributed cluster system, and the generation mode of the first heartbeat message is not specifically limited in this embodiment.
It should be noted that, the network connectivity status table of each node is obtained by detecting the CTDB service therein, and the specific detection manner may be to determine whether the heartbeat response duration between each node and each other network node exceeds the response timeout duration corresponding to each other network node, so as to determine the network connectivity status; or further judging whether the times exceeding the response timeout time is larger than the corresponding times threshold of each other network node so as to determine the network connection state, thereby avoiding the network connection state acquisition error caused by different response performances of different network nodes, realizing the accurate acquisition of the network connection state table and further improving the accuracy of node fault detection.
The transmission mode may be real-time transmission, that is, generation and transmission of the first heartbeat message, or periodic transmission, and the transmission mode of the first heartbeat message is not specifically limited in this embodiment.
In some embodiments, sending the first heartbeat message to the second node in the distributed cluster system includes:
generating a target network communication state table according to the network communication state between the target network communication state table and each network node in the distributed cluster system;
generating the first heartbeat message according to the target network communication state table;
and sending the first heartbeat message to the second node under the condition that the time interval between the current time and the last sending time meets a time interval threshold value.
Optionally, when the first node monitors the executing node, at least three threads, namely a sending thread, a receiving thread and a detecting thread, are started, and network communication states between the first node and each network node in the distributed cluster system are collected through the sending thread to generate a target network communication state table; and writing the target network communication state table into the heartbeat message to generate a first heartbeat message.
And then judging whether the time interval between the current time and the last transmission time meets a time interval threshold, and if the time interval threshold is met, transmitting a first heartbeat message to the second node. The time interval threshold may be set according to actual requirements, for example, 1 second.
According to the method provided by the embodiment, the first heartbeat message carrying the target network communication state table formed by the network communication states between the first node and each network node in the distributed cluster system is sent to the second node, so that the network connection states between the nodes are shared and synchronized in real time, the second node can update the current network communication state table of the second node in real time according to the target network communication state table, the current network communication state table is written into the second heartbeat message, the second heartbeat message is returned to the first node in the form of response information, and therefore more reliable response messages can be sent in time, the accuracy of node fault monitoring is improved, and the stability, safety and reliability of the cluster system are further improved.
Step 102, receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and the current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message;
optionally, under the condition that the first heartbeat message is sent to the second node, the second node may analyze the target network connection state table of the first node from the first heartbeat message, so as to update the network connection state table of the second node according to the target network connection state table, obtain the second heartbeat message, and respond the second heartbeat message to the first node.
The receiving thread of the first node can monitor each network node in the distributed cluster system in real time, so that the second node can receive the second heartbeat message in real time under the condition that the second node returns the second heartbeat message according to the first heartbeat message, and after receiving the second heartbeat message, the receiving thread can analyze the second heartbeat message to acquire a current network communication state table of the second node, and transmit the current network communication state table to the detecting thread, and the detecting thread acquires the current heartbeat timeout times according to the current network communication state table.
The method for obtaining the current heartbeat timeout times can be that the current network connection state table and the last heartbeat timeout times between the first node and the second node are input into a pre-trained update model, and the current heartbeat timeout times are output; or the preset updating rule is adopted to analyze the current network connection state table so as to update the last heartbeat timeout times according to the analysis result to obtain the current heartbeat timeout times, which is not particularly limited in the embodiment.
And step 103, obtaining a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table.
The fault monitoring result includes a normal state or a fault state.
Optionally, after the detection thread acquires the current heartbeat timeout times and the current network connection state table, the detection thread may combine the current heartbeat timeout times and the current network connection state table to acquire a fault monitoring result of the second node.
Here, the manner of obtaining the fault monitoring result of the second node includes: inputting the current heartbeat timeout times and the current network communication state table into a pre-trained detection model, and outputting a fault monitoring result of the second node by the detection model; or, one or multiple fault monitoring judgment conditions are adopted to judge and analyze the current heartbeat timeout times and the current network connection state table so as to obtain a fault monitoring result of the second node, which is not particularly limited in the embodiment.
According to the node fault monitoring method provided by the embodiment of the application, the first node sends the first heartbeat message to the second node and receives the second heartbeat message sent by the second node, so that the current network communication state table between the second node and each network node is synchronized, the current heartbeat timeout times are further obtained according to the current network communication state table, and the fault monitoring result of the second node is jointly obtained according to the current network communication state table and the current heartbeat timeout times, so that the CTDB accurately analyzes the fault node in the sub-health state of the network, the actions of performing fault switching and fault recovery on the normal node caused by misjudgment are prevented, the stability and reliability of node detection are improved, and the stability, the safety and the reliability of the cluster are further improved.
In some embodiments, the obtaining the fault monitoring result of the second node according to the current heartbeat timeout number and the current network connectivity status table includes:
comparing the current heartbeat timeout times with a frequency threshold value to obtain a first comparison result;
judging whether a network communication state between at least one third node and the second node in the distributed cluster system is a normal state according to the current network communication state table under the condition that the current heartbeat timeout times are larger than the times threshold according to the first comparison result;
obtaining a fault monitoring result of the second node according to the judging result;
the third node is a network node except the first node and the second node in the distributed cluster system.
Optionally, the step of obtaining the fault monitoring result of the second node in step 103 specifically includes:
after the detection thread acquires the current heartbeat timeout times and the current network connection state table, comparing the current heartbeat timeout times with a time threshold to determine whether the current heartbeat timeout times are larger than the time threshold. The number of times threshold here may be set according to the fault tolerance of the first node.
And under the condition that the current heartbeat timeout times are larger than the time threshold, representing that abnormal conditions exist in communication between the second node and the first node, and further determining whether the second node is a fault node according to the current network communication state table.
Optionally, according to the current network connection state table, obtaining a network connection state between the second node and a network node, namely a third node, except the first node and the second node in the distributed cluster system, so as to determine whether the network connection state between the second node and at least one third node is a normal state according to the network connection state, and further determine whether the second node is a fault node according to a judging result.
According to the method provided by the embodiment, the node fault monitoring is carried out in a mode of adopting multiple repeated judgment by judging the current heartbeat timeout times and judging the current network communication state table, so that whether the second node is a fault node or not is accurately analyzed by the CTDB service in a network sub-health state, further, the actions of performing fault switching and fault recovery on the normal node caused by misjudgment are prevented, the node fault detection precision is improved, and the stability and reliability of the cluster system are improved.
In some embodiments, the obtaining, according to the determination result, a fault monitoring result of the second node includes:
and under the condition that the network communication state between at least one third node and the second node is not normal in the distributed cluster system according to the judging result, determining that the fault monitoring result of the second node is a fault state.
Optionally, in the case that it is determined that the network communication state between at least one third node and the second node in the distributed cluster system is not in a normal state, that is, the network communication states between all third nodes and the second node in the distributed cluster system are in an abnormal state, it is indicated that abnormal communication exists between the second node and any network node in the distributed cluster system, and it is determined that the second node is a fault node.
According to the method provided by the embodiment, the second node is determined to be the fault node through judging the current heartbeat timeout times and judging the current network communication state table repeatedly, so that the node fault detection accuracy is higher under the condition that the second node meets the fault condition through the repeated judgment, and the stability and the reliability of the cluster system are improved.
In some embodiments, the obtaining, according to the determination result, a fault monitoring result of the second node includes:
acquiring the number of referenceable nodes corresponding to the second node under the condition that the network communication state between at least one third node and the second node in the distributed cluster system is determined to be in a normal state according to the judging result;
obtaining a fault monitoring result of the second node according to the number of the referenceable nodes;
the referenceable node is configured to provide a response message for updating the current network connection status table of the second node in a preset period.
Optionally, in the case that it is determined that the network communication state between at least one third node and the second node in the distributed cluster system is a normal state, that is, the network communication states between all third nodes and the second node in the distributed cluster system are abnormal states, it is indicated that normal communication exists between the second node and part of network nodes in the distributed cluster system, and in order to determine whether the second node is a fault node more accurately, further fault detection condition restoration is required.
Optionally, a network node providing a response message for updating the current network connection state table of the second node in the distributed cluster system in a preset period is obtained, that is, a network node which is in normal network communication connection with the second node in the preset period is obtained, so as to obtain the referenceable node. The preset period may include the current period, or include the current period and a plurality of periods before the current period, and may specifically be determined according to the fault tolerance of the first node, which is not specifically limited in this embodiment.
And counting the number of the referenceable nodes, comparing the number of the referenceable nodes with a number threshold to determine whether the number of the referenceable nodes is larger than the number threshold, and determining that the fault monitoring result of the second node is in a normal state when the number of the referenceable nodes is larger than the number threshold. The number threshold may be set according to actual requirements, such as 0.
In some embodiments, the method further comprises:
and determining that the fault monitoring result of the second node is a fault state under the condition that the number of the referenceable nodes is less than or equal to the number threshold according to the second comparison result.
Optionally, in the case that the number of referenceable nodes is less than or equal to the number threshold, the second node is further characterized as being unstable in communication, and the second node is determined to be a failed node.
In the method provided by the embodiment, under the condition that the network communication state between at least one third node and the second node in the distributed cluster system is determined to be in a normal state, whether the number of the referenceable nodes which are in normal network communication connection with the second node in a preset period is larger than the number threshold is further judged, so that the communication stability of the second node is further judged, whether the second node is a fault node is accurately determined, and the stability and reliability of the cluster system are further improved.
In some embodiments, the method further comprises:
triggering an isolation action if it is determined that the number of referenceable nodes is greater than the number threshold based on the second comparison result;
the isolation action is used for isolating the first node from other network nodes except the first node in the distributed cluster system or isolating the network port of the first node from the network ports of other network nodes.
Optionally, in the case that the number of referenceable nodes is determined to be greater than the number threshold, it may be further determined that the first node is in an agnostic state, so as to improve accuracy of fault detection of subsequent nodes, an isolation action may be triggered at this time to isolate the first node from other network nodes except the first node in the distributed cluster system, or isolate a network port of the first node from a network port of the other network nodes, that is, set the first node to a Ban (forbidden) state, so that the first node does not participate in fault detection logic of the other network nodes.
In some embodiments, the method further comprises:
under the condition that the fault monitoring result of the second node is determined to be in a fault state, a fourth node is obtained in the distributed cluster system; the fourth node is a network node with the same service function as the second node and the fault monitoring result is in a normal state;
migrating the task to be processed of the second node to the fourth node;
and under the condition that the fault monitoring result of the second node is switched from the fault state to the normal state, recovering the task to be processed to the second node.
The service function includes one or more of a monitoring function, a data processing function, a data storage function, and a data forwarding function, which is not specifically limited in this embodiment.
Optionally, when the second node is determined to be a fault node according to the fault monitoring result, that is, the fault state, a network node which has the same service function as the second node and is in a normal state can be obtained in the distributed cluster system as a fourth node, so that the second node is subjected to fault switching based on the fourth node, that is, a task to be processed of the second node is migrated to the fourth node, so that the fourth node performs service processing on the task to be processed of the second node, normal operation of the distributed cluster system is maintained, and stability, safety and reliability of the distributed cluster system are ensured.
And under the condition that the fault monitoring result of the second node is switched from the fault state to the normal state, the second node can be subjected to fault recovery, namely, the task to be processed is recovered to the second node so as to continue service processing.
According to the method provided by the embodiment, under the condition that the second node is the fault node, the tasks to be processed of the second node can be quickly migrated and restored, and the stability, safety and reliability of the distributed cluster system are improved.
In some embodiments, the obtaining, according to the second heartbeat packet, a current heartbeat timeout number with the second node, and a current network connectivity status table of the second node includes:
analyzing the second heartbeat message to obtain the current network communication state table;
determining a current network communication state between the second node and the current network communication state table according to the current network communication state table;
updating the count value of the heartbeat timeout counter according to the current network connection state;
and acquiring the current heartbeat timeout times according to the updated count value.
FIG. 2 is a second flow chart of a node failure monitoring method according to an embodiment of the present application; as shown in fig. 2, the step of obtaining the current heartbeat timeout number and the current network connectivity status table in step 102 includes:
step 1021, analyzing the second heartbeat message to obtain a current network communication state table; acquiring a current network communication state between a first node and a second node from a current network communication state table;
the current network connectivity status may be determined and obtained according to a storage status and/or content of connectivity information between the first node and the second node included in the current network connectivity status table.
In some embodiments, the step of determining the current network connectivity status comprises:
searching the communication information between the current network communication state table and the second node;
and under the condition that the searching result is empty, determining that the current network communication state is an abnormal communication state.
Optionally, a communication information set between the second node and each network node is obtained in the current network communication state table, and then according to the identification of the first node and the identification of the second node, the communication information between the first node and the second node is searched in the communication information set, and whether the search result is empty or not is determined, that is, whether the communication information between the first node and the second node is deleted or not is determined. The communication information and the combination identifier formed by the identifier of the first node and the identifier of the second node (directly splicing the identifier of the first node and the identifier of the second node) or the code identifier (recoding the identifier of the first node and the identifier of the second node) are pre-established with a mapping relation.
And under the condition that the communication information between the first node and the second node is deleted, the current network communication state between the first node and the second node is characterized as an abnormal communication state.
In some embodiments, the method further comprises:
if the connection information is found according to the search result, determining whether the connection with the second node is disconnected according to the connection information;
and under the condition that disconnection with the second node is determined, determining that the current network communication state is an abnormal communication state.
Optionally, if the search result is that the connectivity information is found, it is further determined whether the first node and the second node are disconnected according to the content of the connectivity information, and if the first node and the second node are disconnected, the current network connectivity state of the first node and the second node is determined to be an abnormal connectivity state.
In some embodiments, the method further comprises:
and under the condition of determining normal connection with the second node, determining that the current network communication state is a normal communication state.
Optionally, under the condition that the normal connection between the first node and the second node is determined, the current network communication state of the first node and the second node is represented as a normal communication state.
Step 1022, according to the current network connection state, the count value of the heartbeat timeout counter is updated, and the updated count value is used as the current heartbeat timeout times. The so-called heartbeat timeout counter is a counter for recording the number of heartbeat timeout times.
It should be noted that, in each case that the second node is detected as a failed node, the count value of the heartbeat timeout counter needs to be reset to zero, so as to continue to provide the failure detection parameter.
The called updating mode can be that the current network connection state and the count value of the heartbeat timeout counter are input into a pre-trained updating model, and the current heartbeat timeout times are output; or, a pre-judging rule is adopted to judge and analyze the current network connection state so as to determine how to update the count value of the heartbeat timeout counter according to the judging and analyzing result, so that the current heartbeat timeout times are obtained.
In some embodiments, the updating the count value of the heartbeat timeout counter according to the current network connectivity status includes:
and under the condition that the current network communication state is determined to be an abnormal communication state, adding 1 to the count value of the heartbeat timeout counter.
Optionally, for the case that the current network connection state is the abnormal connection state, it may be determined that the transmission duration of the heartbeat message between the first node and the second node is overtime, and at this time, the count value of the heartbeat timeout counter is added up by 1 to obtain the current heartbeat timeout times corresponding to the second node.
In some embodiments, the updating the count value of the heartbeat timeout counter according to the current network connectivity status includes:
and under the condition that the current network communication state is determined to be a normal communication state, keeping the count value of the heartbeat timeout counter unchanged.
Optionally, under the condition that the current network connection state is the normal connection state, it may be determined that the transmission duration of the heartbeat message between the first node and the second node is not overtime, and at this time, the count value of the heartbeat timeout counter is kept unchanged, so as to obtain the current heartbeat timeout times corresponding to the second node.
According to the method provided by the embodiment, the current network communication state between the first node and the second node is judged, so that the second current heartbeat timeout times can be accurately obtained in real time, and the nodes are repeatedly judged in combination with the heartbeat timeout times to realize fault monitoring of the nodes, so that the fault monitoring accuracy is improved, and the stability and reliability of the cluster system are improved.
In some embodiments, the method further comprises:
updating the target network communication state table according to the current network communication state table;
acquiring a target fault monitoring result according to the updated target network communication state table and the current heartbeat timeout times; and the target fault monitoring result is the fault monitoring result of the first node.
FIG. 3 is a third flow chart of a node failure monitoring method according to an embodiment of the present application; as shown in fig. 3, the step of performing fault monitoring on the present node (i.e., the first node) includes:
step 301, obtaining a network connection state between a first node and each network node, and obtaining a target network connection state table of the first node;
step 302, a first heartbeat message carrying a target network connection state table is sent to a second node;
step 303, receiving a second heartbeat message returned by the second node, obtaining a network connection state table of the second node, and updating the network connection state table of the first node according to the current network connection state table to obtain an updated target network connection state table; acquiring the current heartbeat timeout times between the second node and the second node according to the current network communication state table;
and step 304, performing fault monitoring on the node (namely the first node) according to the updated target network connection state table and the current heartbeat timeout times.
Optionally, if the current heartbeat timeout number is greater than the number threshold, determining that there is a timeout in the transmitted heartbeat message between the second node and the node. At this time, whether the network communication state between at least one third node and the first node in the distributed cluster system is in a normal state is judged according to the updated target network communication state table, if not, the fault of the node is determined, and the fault switching of the node is triggered; if the node exists, the number of the referenceable nodes corresponding to the first node is further obtained, whether the number of the referenceable nodes is larger than a number threshold value is judged, if the number of the referenceable nodes is larger than the number threshold value, the node is determined to be in an unknown state, the node is set to be in a Banstate, and the detection logic of other network nodes is not participated; and if the number of the nodes is not greater than the number threshold, determining the node as a fault node, and triggering the fault switching of the node.
According to the method provided by the embodiment, the updated target network communication state table of the first node and the current heartbeat timeout times are used for carrying out repeated judgment, so that real-time fault monitoring can be carried out on the node, whether the first node is a fault node or not can be accurately analyzed under the sub-health state of the network by the CTDB, the actions of performing fault switching and fault recovery on the normal node caused by misjudgment are prevented, the stability and reliability of node detection are improved, and the stability, safety and reliability of the cluster are further improved.
FIG. 4 is a flowchart illustrating a method for monitoring node failure according to an embodiment of the present application; fig. 5 is a timing diagram of interaction between network nodes according to an embodiment of the present application; as shown in fig. 4 and fig. 5, the node 1 is taken as a first node, the node 2 is taken as a second node, and the node 3 is taken as other nodes except the first node and the second node, which are used as examples, and the node fault monitoring method provided by the embodiment is described in a development way;
optionally, the sending thread, the receiving thread and the detecting thread deployed in the cluster trivial database of the first node (hereinafter also referred to as the present node or node 1) are started to perform fault monitoring on each second node (hereinafter also referred to as node 2), and the specific steps include:
Step 401, a sending thread generates a first heartbeat message according to the network connection state of the node and each network node;
step 402, a sending thread sends a first heartbeat message;
step 403, the receiving thread receives a second heartbeat message returned by the second node; the second heartbeat message is generated by the second node sending the heartbeat message to each other network node (such as node 3) when receiving the first heartbeat message, and updating the network connection state table of the second node according to the heartbeat message carrying the network connection state table returned by each other network node.
Step 404, the receiving thread analyzes the second heartbeat message, obtains the current network connection state table of the second node, and updates the target network connection state table of the node according to the current network connection state table;
step 405, the detecting thread circularly judges the heartbeat timeout condition between the node and the second node according to the current network connection state table, and the detection thread is realized through steps 406 to 408;
step 406, the detecting thread judges whether the communication information between the node and the second node is deleted or not according to the current network communication state table, or whether the node and the second node are disconnected, if so, step 408 is executed, and if not, step 407 is executed;
Step 407, the detecting thread refreshes the network connection state between the node and the second node to be a normal state, the count value of the heartbeat timeout counter is kept unchanged, and step 409 is executed;
step 408, the detecting thread refreshes the network connection state between the node and the second node to be an abnormal state, the count value of the heartbeat timeout counter is added by 1, and step 409 is executed;
step 409, determining whether the current heartbeat timeout between the node and the second node is greater than a threshold, and if so, executing step 410; if not, determining the second node as a normal node, and exiting;
step 410, judging whether the network communication state between at least one network node and the second node is a normal state, if so, executing step 411, and if not, determining the second node as a fault node, and executing step 413;
step 411, determining the number of referenceable nodes, if the number is greater than 0, determining that the second node is a normal node, executing step 412, if not, determining that the second node is a failure node, executing step 413;
step 412, without performing node failover on the second node, updating the node to the BAN state, and exiting;
Step 413, performing node failover on the second node.
If the node is determined to be a fault node, the node may be isolated by a network port or by a node isolation operation to ensure stability of the distributed cluster system.
In addition, when the first node finds out that the heartbeat time-out exists between the first node and the second node, the first node can also inform other network nodes in real time, find out a node with a problem in network communication, and trigger network sub-health detection between the first node and the problem node; after receiving the message of opening the network sub-health detection, other nodes trigger the network sub-health detection between the other nodes and the problem node, and after the network sub-health detection of the other nodes and the second node is summarized, when more than half of network nodes are found to detect that the second node has the network sub-health, the network port isolation or the node isolation action of the second node is triggered, so that the stability of the distributed cluster system is ensured. The node fault monitoring device provided by the invention is described below, and the node fault monitoring device described below and the node fault monitoring method described above can be referred to correspondingly.
Fig. 6 is a schematic structural diagram of a node fault monitoring device according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:
the sending thread 601 is configured to send a first heartbeat packet to a second node in the distributed cluster system;
the receiving thread 602 is configured to receive a second heartbeat message returned by the second node, and obtain, according to the second heartbeat message, a current heartbeat timeout number between the second node and the second node, and a current network connection state table of the second node; the second heartbeat message is a response message of the first heartbeat message;
the detecting thread 603 is configured to obtain a fault monitoring result of the second node according to the current heartbeat timeout number and the current network connection state table.
According to the node fault monitoring device provided by the embodiment of the application, the first node sends the first heartbeat message to the second node and receives the second heartbeat message sent by the second node, so that the current network communication state table between the second node and each network node is synchronized, the current heartbeat timeout times are further obtained according to the current network communication state table, and the fault monitoring result of the second node is jointly obtained according to the current network communication state table and the current heartbeat timeout times, so that the CTDB accurately analyzes the fault node in the sub-health state of the network, the actions of performing fault switching and fault recovery on the normal node caused by misjudgment are prevented, the stability and the reliability of node detection are improved, and the stability, the safety and the reliability of the cluster are further improved.
The embodiment of the application also provides a node fault monitoring system, which comprises: a distributed cluster system; the distributed cluster system comprises a first node, a plurality of second nodes and a cluster trivial database;
the cluster trivial database is used for providing network communication state detection service for the first node and the second node;
the first node is used for executing a node fault monitoring method, and the method comprises the steps of sending a first heartbeat message to a second node in the distributed cluster system; receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and a current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message; and acquiring a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table.
According to the node fault monitoring system provided by the embodiment of the application, the first node sends the first heartbeat message to the second node and receives the second heartbeat message sent by the second node, so that the current network communication state table between the second node and each network node is synchronized, the current heartbeat timeout times are further obtained according to the current network communication state table, and the fault monitoring result of the second node is jointly obtained according to the current network communication state table and the current heartbeat timeout times, so that the CTDB accurately analyzes the fault node in the sub-health state of the network, the actions of performing fault switching and fault recovery on the normal node caused by misjudgment are prevented, the stability and the reliability of node detection are improved, and the stability, the safety and the reliability of the cluster are further improved.
Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: a processor (processor) 701, a communication interface (Communications Interface) 702, a memory (memory) 703 and a communication bus 704, wherein the processor 701, the communication interface 702 and the memory 703 communicate with each other through the communication bus 704. The processor 701 may invoke logic instructions in the memory 703 to perform a node failure monitoring method comprising: sending a first heartbeat message to a second node in the distributed cluster system; receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and a current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message; and acquiring a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table.
Further, the logic instructions in the memory 703 may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing the node fault monitoring method provided by the above methods, the method comprising: sending a first heartbeat message to a second node in the distributed cluster system; receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and a current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message; and acquiring a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the node failure monitoring method provided by the above methods, the method comprising: sending a first heartbeat message to a second node in the distributed cluster system; receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and a current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message; and acquiring a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (18)

1. A method for monitoring node failure, applied to a first node in a distributed cluster system, comprising:
sending a first heartbeat message to a second node in the distributed cluster system;
receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and a current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message;
acquiring a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table;
The obtaining the fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table includes:
comparing the current heartbeat timeout times with a frequency threshold value to obtain a first comparison result;
judging whether a network communication state between at least one third node and the second node in the distributed cluster system is a normal state according to the current network communication state table under the condition that the current heartbeat timeout times are larger than the times threshold according to the first comparison result;
obtaining a fault monitoring result of the second node according to the judging result;
the third node is a network node except the first node and the second node in the distributed cluster system;
and acquiring a fault monitoring result of the second node according to the judging result, wherein the fault monitoring result comprises:
and under the condition that the network communication state between at least one third node and the second node is not normal in the distributed cluster system according to the judging result, determining that the fault monitoring result of the second node is a fault state.
2. The method for monitoring a node failure according to claim 1, wherein the obtaining the failure monitoring result of the second node according to the determination result includes:
acquiring the number of referenceable nodes corresponding to the second node under the condition that the network communication state between at least one third node and the second node in the distributed cluster system is determined to be in a normal state according to the judging result;
obtaining a fault monitoring result of the second node according to the number of the referenceable nodes;
the referenceable node is configured to provide a response message for updating the current network connection status table of the second node in a preset period.
3. The method for monitoring node failure according to claim 2, wherein the obtaining the failure monitoring result of the second node according to the number of referenceable nodes includes:
comparing the number of the referenceable nodes with a number threshold to obtain a second comparison result;
and under the condition that the number of the referenceable nodes is larger than the number threshold according to the second comparison result, determining that the fault monitoring result of the second node is in a normal state.
4. A node failure monitoring method according to claim 3, characterized in that the method further comprises:
triggering an isolation action if it is determined that the number of referenceable nodes is greater than the number threshold based on the second comparison result;
the isolation action is used for isolating the first node from other network nodes except the first node in the distributed cluster system or isolating the network port of the first node from the network ports of other network nodes.
5. A node failure monitoring method according to claim 3, characterized in that the method further comprises:
and determining that the fault monitoring result of the second node is a fault state under the condition that the number of the referenceable nodes is less than or equal to the number threshold according to the second comparison result.
6. The node failure monitoring method of any of claims 1-5, further comprising:
under the condition that the fault monitoring result of the second node is determined to be in a fault state, a fourth node is obtained in the distributed cluster system; the fourth node is a network node with the same service function as the second node and the fault monitoring result is in a normal state;
Migrating the task to be processed of the second node to the fourth node;
and under the condition that the fault monitoring result of the second node is switched from the fault state to the normal state, recovering the task to be processed to the second node.
7. The method for monitoring node failure according to any one of claims 1-5, wherein the obtaining, according to the second heartbeat message, a current heartbeat timeout number with the second node, and a current network connection state table of the second node includes:
analyzing the second heartbeat message to obtain the current network communication state table;
determining a current network communication state between the second node and the current network communication state table according to the current network communication state table;
updating the count value of the heartbeat timeout counter according to the current network connection state;
and acquiring the current heartbeat timeout times according to the updated count value.
8. The method for monitoring node failure according to claim 7, wherein updating the count value of the heartbeat timeout counter according to the current network connectivity status includes:
and under the condition that the current network communication state is determined to be an abnormal communication state, adding 1 to the count value of the heartbeat timeout counter.
9. The method for monitoring node failure according to claim 7, wherein updating the count value of the heartbeat timeout counter according to the current network connectivity status includes:
and under the condition that the current network communication state is determined to be a normal communication state, keeping the count value of the heartbeat timeout counter unchanged.
10. The method of claim 7, wherein determining the current network connectivity status with the second node based on the current network connectivity status table comprises:
searching the communication information between the current network communication state table and the second node;
and under the condition that the searching result is empty, determining that the current network communication state is an abnormal communication state.
11. The node failure monitoring method of claim 10, further comprising:
if the connection information is found according to the search result, determining whether the connection with the second node is disconnected according to the connection information;
and under the condition that disconnection with the second node is determined, determining that the current network communication state is an abnormal communication state.
12. The node failure monitoring method of claim 11, further comprising:
and under the condition of determining normal connection with the second node, determining that the current network communication state is a normal communication state.
13. The method for monitoring node failure according to any of claims 1-5, wherein the sending a first heartbeat message to a second node in the distributed cluster system includes:
generating a target network communication state table according to the network communication state between the target network communication state table and each network node in the distributed cluster system;
generating the first heartbeat message according to the target network communication state table;
and sending the first heartbeat message to the second node under the condition that the time interval between the current time and the last sending time meets a time interval threshold value.
14. The node failure monitoring method of claim 13, characterized in that the method further comprises:
updating the target network communication state table according to the current network communication state table;
acquiring a target fault monitoring result according to the updated target network communication state table and the current heartbeat timeout times; and the target fault monitoring result is the fault monitoring result of the first node.
15. A node failure monitoring apparatus, for use with a first node in a distributed cluster system, comprising:
the sending module is used for sending a first heartbeat message to a second node in the distributed cluster system;
the receiving module is used for receiving a second heartbeat message returned by the second node, and acquiring the current heartbeat timeout times between the second node and the current network communication state table of the second node according to the second heartbeat message; the second heartbeat message is a response message of the first heartbeat message;
the detection module is used for acquiring a fault monitoring result of the second node according to the current heartbeat timeout times and the current network connection state table;
the detection module is specifically configured to:
comparing the current heartbeat timeout times with a frequency threshold value to obtain a first comparison result;
judging whether a network communication state between at least one third node and the second node in the distributed cluster system is a normal state according to the current network communication state table under the condition that the current heartbeat timeout times are larger than the times threshold according to the first comparison result;
Obtaining a fault monitoring result of the second node according to the judging result;
the third node is a network node except the first node and the second node in the distributed cluster system;
the detection module is further used for:
and under the condition that the network communication state between at least one third node and the second node is not normal in the distributed cluster system according to the judging result, determining that the fault monitoring result of the second node is a fault state.
16. A node fault monitoring system, comprising a distributed cluster system;
the distributed cluster system comprises a first node, a plurality of second nodes and a cluster trivial database;
the cluster trivial database is used for providing network communication state detection service for the first node and the second node;
the first node is configured to perform the node failure monitoring method of any of claims 1-14.
17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the node fault monitoring method of any of claims 1 to 14 when the program is executed by the processor.
18. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the node fault monitoring method of any of claims 1 to 14.
CN202310955919.5A 2023-08-01 2023-08-01 Node fault monitoring method, device and system, electronic equipment and storage medium Active CN116684256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310955919.5A CN116684256B (en) 2023-08-01 2023-08-01 Node fault monitoring method, device and system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310955919.5A CN116684256B (en) 2023-08-01 2023-08-01 Node fault monitoring method, device and system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116684256A CN116684256A (en) 2023-09-01
CN116684256B true CN116684256B (en) 2023-11-03

Family

ID=87791323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310955919.5A Active CN116684256B (en) 2023-08-01 2023-08-01 Node fault monitoring method, device and system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116684256B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117424791B (en) * 2023-12-18 2024-03-19 国网天津市电力公司信息通信公司 Large-scale power communication network fault diagnosis system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107426051A (en) * 2017-07-19 2017-12-01 北京华云网际科技有限公司 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system
CN109088794A (en) * 2018-08-20 2018-12-25 郑州云海信息技术有限公司 A kind of fault monitoring method and device of node
CN109218141A (en) * 2018-11-20 2019-01-15 郑州云海信息技术有限公司 A kind of malfunctioning node detection method and relevant apparatus
CN113542052A (en) * 2021-06-07 2021-10-22 新华三信息技术有限公司 Node fault determination method and device and server
CN115102887A (en) * 2022-07-15 2022-09-23 济南浪潮数据技术有限公司 Cluster node monitoring method and related equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107426051A (en) * 2017-07-19 2017-12-01 北京华云网际科技有限公司 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system
CN109088794A (en) * 2018-08-20 2018-12-25 郑州云海信息技术有限公司 A kind of fault monitoring method and device of node
CN109218141A (en) * 2018-11-20 2019-01-15 郑州云海信息技术有限公司 A kind of malfunctioning node detection method and relevant apparatus
CN113542052A (en) * 2021-06-07 2021-10-22 新华三信息技术有限公司 Node fault determination method and device and server
CN115102887A (en) * 2022-07-15 2022-09-23 济南浪潮数据技术有限公司 Cluster node monitoring method and related equipment

Also Published As

Publication number Publication date
CN116684256A (en) 2023-09-01

Similar Documents

Publication Publication Date Title
US10491671B2 (en) Method and apparatus for switching between servers in server cluster
US20190196894A1 (en) Detecting and analyzing performance anomalies of client-server based applications
US20210006484A1 (en) Fault detection method, apparatus, and system
US8041996B2 (en) Method and apparatus for time-based event correlation
US8656219B2 (en) System and method for determination of the root cause of an overall failure of a business application service
US20190095266A1 (en) Detection of Misbehaving Components for Large Scale Distributed Systems
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
US20090070639A1 (en) Administering Correlated Error Logs In A Computer System
CN116684256B (en) Node fault monitoring method, device and system, electronic equipment and storage medium
US11334468B2 (en) Checking a correct operation of an application in a cloud environment
CN112506702B (en) Disaster recovery method, device, equipment and storage medium for data center
US10831579B2 (en) Error detecting device and error detecting method for detecting failure of hierarchical system, computer readable recording medium, and computer program product
CN104506392A (en) Downtime detecting method and device
JP2010198491A (en) Virtual machine server, and virtual machine network monitoring system using the same
CN111737079B (en) Cluster network monitoring method and device
CN111258845A (en) Detection of event storms
CN114296979A (en) Method and device for detecting abnormal state of Internet of things equipment
JP2018156348A (en) Fault monitoring apparatus, fault monitoring system, and program
EP3756310B1 (en) Method and first node for managing transmission of probe messages
CN111901174A (en) Service state notification method, related device and storage medium
CN114513398B (en) Network equipment alarm processing method, device, equipment and storage medium
Chen et al. Graph neural network based robust anomaly detection at service level in SDN driven microservice system
Gunasekaran et al. Correlating log messages for system diagnostics
CN114173344B (en) Method, device, electronic equipment and storage medium for processing communication data
CN115348157B (en) Fault positioning method, device and equipment of distributed storage cluster and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant