CN116684256B - Node fault monitoring method, device, system, electronic equipment and storage medium - Google Patents
Node fault monitoring method, device, system, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN116684256B CN116684256B CN202310955919.5A CN202310955919A CN116684256B CN 116684256 B CN116684256 B CN 116684256B CN 202310955919 A CN202310955919 A CN 202310955919A CN 116684256 B CN116684256 B CN 116684256B
- Authority
- CN
- China
- Prior art keywords
- node
- current
- heartbeat
- fault monitoring
- communication state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 153
- 238000000034 method Methods 0.000 title claims abstract description 106
- 238000001514 detection method Methods 0.000 claims abstract description 45
- 230000004044 response Effects 0.000 claims abstract description 25
- 230000009471 action Effects 0.000 claims abstract description 17
- 238000004891 communication Methods 0.000 claims description 49
- 230000002159 abnormal effect Effects 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 16
- 238000002955 isolation Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 9
- 238000011084 recovery Methods 0.000 abstract description 9
- 238000005516 engineering process Methods 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 10
- 238000012806 monitoring device Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 230000005856 abnormality Effects 0.000 description 3
- 230000001960 triggered effect Effects 0.000 description 3
- 206010033799 Paralysis Diseases 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011895 specific detection Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0811—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Health & Medical Sciences (AREA)
- Cardiology (AREA)
- General Health & Medical Sciences (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本发明提供一种节点故障监测方法、装置、系统、电子设备及存储介质,涉及计算机技术领域,该方法包括:向所述分布式集群系统中的第二节点发送第一心跳报文;接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。本发明实现在网络亚健康状态下准确的分析出故障节点,防止误判导致正常节点被执行故障切换和故障恢复的动作,由此提高节点检测的稳定性和可靠性,进而提高集群的稳定性、安全性和可靠性。
The present invention provides a node fault monitoring method, device, system, electronic equipment and storage medium, and relates to the field of computer technology. The method includes: sending a first heartbeat message to a second node in the distributed cluster system; receiving the first heartbeat message; According to the second heartbeat message returned by the second node, obtain the current number of heartbeat timeouts with the second node and the current network connectivity status table of the second node; The second heartbeat message is a response message to the first heartbeat message; and the fault monitoring result of the second node is obtained according to the current number of heartbeat timeouts and the current network connectivity status table. The present invention enables accurate analysis of faulty nodes in a sub-healthy network state, preventing misjudgment from causing normal nodes to perform failover and fault recovery actions, thereby improving the stability and reliability of node detection, and thereby improving the stability of the cluster. , safety and reliability.
Description
技术领域Technical field
本发明涉及计算机技术领域,尤其涉及一种节点故障监测方法、装置、系统、电子设备及存储介质。The present invention relates to the field of computer technology, and in particular to a node fault monitoring method, device, system, electronic equipment and storage medium.
背景技术Background technique
分布式集群系统是由多个节点服务器构成的集群,各个节点都运行着处理程序,当一个或几个节点的网络状态处于故障状态,则会影响整个分布式集群系统的性能。因此,如何高效精准地监测出故障节点是目前业界亟待解决的重要课题。A distributed cluster system is a cluster composed of multiple node servers. Each node runs a processing program. When the network status of one or several nodes is in a fault state, it will affect the performance of the entire distributed cluster system. Therefore, how to efficiently and accurately monitor failed nodes is an important issue that needs to be solved urgently in the industry.
相关技术中,通常通过PING(Packet Internet Groper,因特网包探索器)或者心跳监测的方式,点对点判断其他节点是否在预设时长内向本节点发送响应信息来判断其他网络节点是否为异常节点,而在网络亚健康的状态下,由于网络连接状态不稳定,因此存在网络异常的节点中的CTDB(Cluster Trivial Database,集群琐碎数据库),若通过PING或心跳监测其他节点传输的响应信息丢失,会误认为其他节点存在故障,使得节点故障检测精度低,进而影响集群系统的稳定性和可靠性。In related technologies, PING (Packet Internet Groper, Internet packet explorer) or heartbeat monitoring is usually used to determine whether other nodes send response information to this node within a preset time period to determine whether other network nodes are abnormal nodes. In a sub-healthy network, due to unstable network connection status, the CTDB (Cluster Trivial Database) in the node with network abnormality will be mistakenly considered as if the response information transmitted by other nodes through PING or heartbeat monitoring is lost. There are faults in other nodes, which makes the node fault detection accuracy low, thus affecting the stability and reliability of the cluster system.
发明内容Contents of the invention
本发明提供一种节点故障监测方法、装置、系统、电子设备及存储介质,用以解决现有技术中节点故障检测精度低,进而影响集群的稳定性和可靠性的缺陷,实现提高节点故障检测精度,由此提高集群系统的稳定性和可靠性。The present invention provides a node fault monitoring method, device, system, electronic equipment and storage medium to solve the shortcomings in the prior art of low node fault detection accuracy, thereby affecting the stability and reliability of the cluster, and to improve node fault detection. accuracy, thus improving the stability and reliability of the cluster system.
本发明提供一种节点故障监测方法,应用于分布式集群系统中的第一节点,包括:The present invention provides a node fault monitoring method, which is applied to the first node in a distributed cluster system, including:
向所述分布式集群系统中的第二节点发送第一心跳报文;Send the first heartbeat message to the second node in the distributed cluster system;
接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;Receive the second heartbeat message returned by the second node, and obtain the current number of heartbeat timeouts with the second node and the current network connectivity status table of the second node according to the second heartbeat message. ;The second heartbeat message is a response message to the first heartbeat message;
根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。Obtain the fault monitoring result of the second node according to the current number of heartbeat timeouts and the current network connectivity status table.
根据本发明提供的一种节点故障监测方法,所述根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果,包括:According to a node fault monitoring method provided by the present invention, obtaining the fault monitoring result of the second node based on the current number of heartbeat timeouts and the current network connectivity status table includes:
将所述当前心跳超时次数与次数阈值进行比较,得到第一比较结果;Compare the current number of heartbeat timeouts with a number threshold to obtain a first comparison result;
在根据所述第一比较结果,确定所述当前心跳超时次数大于所述次数阈值的情况下,根据所述当前网络连通状态表,判断所述分布式集群系统中是否存在至少一个第三节点与所述第二节点之间的网络连通状态为正常状态;When it is determined that the current number of heartbeat timeouts is greater than the number threshold according to the first comparison result, it is determined according to the current network connectivity status table whether there is at least one third node in the distributed cluster system. The network connectivity state between the second nodes is a normal state;
根据判断结果,获取所述第二节点的故障监测结果;According to the judgment result, obtain the fault monitoring result of the second node;
其中,所述第三节点为所述分布式集群系统中除所述第一节点和所述第二节点之外的网络节点。Wherein, the third node is a network node other than the first node and the second node in the distributed cluster system.
根据本发明提供的一种节点故障监测方法,所述根据判断结果,获取所述第二节点的故障监测结果,包括:According to a node fault monitoring method provided by the present invention, obtaining the fault monitoring result of the second node according to the judgment result includes:
在根据所述判断结果,确定所述分布式集群系统中不存在至少一个所述第三节点与所述第二节点之间的网络连通状态为正常状态的情况下,确定所述第二节点的故障监测结果为故障状态。When it is determined according to the judgment result that there is no at least one network connection state between the third node and the second node in the distributed cluster system and the network connectivity state between the second node and the third node is in a normal state, it is determined that the second node The fault monitoring result is fault status.
根据本发明提供的一种节点故障监测方法,所述根据判断结果,获取所述第二节点的故障监测结果,包括:According to a node fault monitoring method provided by the present invention, obtaining the fault monitoring result of the second node according to the judgment result includes:
在根据所述判断结果,确定所述分布式集群系统中存在至少一个所述第三节点与所述第二节点之间的网络连通状态为正常状态的情况下,获取所述第二节点对应的可参考节点的数量;When it is determined that there is at least one third node in the distributed cluster system and the network connectivity state between the third node and the second node is normal according to the judgment result, obtain the data corresponding to the second node. The number of reference nodes;
根据所述可参考节点的数量,获取所述第二节点的故障监测结果;Obtain the fault monitoring result of the second node according to the number of reference nodes;
其中,所述可参考节点用于在预设周期内为更新得到所述第二节点的当前网络连通状态表提供响应报文。Wherein, the reference node is used to provide a response message for updating the current network connectivity status table of the second node within a preset period.
根据本发明提供的一种节点故障监测方法,所述根据所述可参考节点的数量,获取所述第二节点的故障监测结果,包括:According to a node fault monitoring method provided by the present invention, obtaining the fault monitoring result of the second node according to the number of reference nodes includes:
将所述可参考节点的数量与数量阈值进行比较,得到第二比较结果;Compare the number of reference nodes with a quantity threshold to obtain a second comparison result;
在根据所述第二比较结果,确定所述可参考节点的数量大于所述数量阈值的情况下,确定所述第二节点的故障监测结果为正常状态。If it is determined that the number of reference nodes is greater than the number threshold according to the second comparison result, it is determined that the fault monitoring result of the second node is in a normal state.
根据本发明提供的一种节点故障监测方法,所述方法还包括:According to a node fault monitoring method provided by the present invention, the method further includes:
在根据所述第二比较结果,确定所述可参考节点的数量大于所述数量阈值的情况下,触发隔离动作;If, according to the second comparison result, it is determined that the number of reference nodes is greater than the number threshold, trigger an isolation action;
其中,所述隔离动作用于将所述第一节点与所述分布式集群系统中除所述第一节点之外的其他网络节点进行隔离,或者将所述第一节点的网口与所述其他网络节点的网口进行隔离。Wherein, the isolation action is used to isolate the first node from other network nodes in the distributed cluster system except the first node, or to isolate the network port of the first node from the The network ports of other network nodes are isolated.
根据本发明提供的一种节点故障监测方法,所述方法还包括:According to a node fault monitoring method provided by the present invention, the method further includes:
在根据所述第二比较结果,确定所述可参考节点的数量小于或等于所述数量阈值的情况下,确定所述第二节点的故障监测结果为故障状态。If it is determined that the number of reference nodes is less than or equal to the number threshold according to the second comparison result, it is determined that the fault monitoring result of the second node is in a fault state.
根据本发明提供的一种节点故障监测方法,所述方法还包括:According to a node fault monitoring method provided by the present invention, the method further includes:
在确定所述第二节点的故障监测结果为故障状态的情况下,在所述分布式集群系统中获取第四节点;所述第四节点为故障监测结果为正常状态,且与所述第二节点具有相同的服务功能的网络节点;When it is determined that the fault monitoring result of the second node is in a fault state, a fourth node is obtained in the distributed cluster system; the fourth node has a fault monitoring result in a normal state and is in a normal state with the second node. Nodes are network nodes with the same service functions;
将所述第二节点的待处理任务迁移至所述第四节点;Migrate the pending tasks of the second node to the fourth node;
在所述第二节点的故障监测结果由故障状态切换为正常状态的情况下,将所述待处理任务恢复至所述第二节点。When the fault monitoring result of the second node switches from a fault state to a normal state, the pending task is restored to the second node.
根据本发明提供的一种节点故障监测方法,所述根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表,包括:According to a node fault monitoring method provided by the present invention, the current number of heartbeat timeouts with the second node and the current network connectivity status table of the second node are obtained according to the second heartbeat message. ,include:
对所述第二心跳报文进行解析,得到所述当前网络连通状态表;Parse the second heartbeat message to obtain the current network connectivity status table;
根据所述当前网络连通状态表,确定与所述第二节点之间的当前网络连通状态;Determine the current network connectivity state with the second node according to the current network connectivity state table;
根据所述当前网络连通状态,对心跳超时计数器的计数值进行更新;Update the count value of the heartbeat timeout counter according to the current network connectivity status;
根据更新后的计数值,获取所述当前心跳超时次数。According to the updated count value, obtain the current number of heartbeat timeouts.
根据本发明提供的一种节点故障监测方法,所述根据所述当前网络连通状态,对心跳超时计数器的计数值进行更新,包括:According to a node fault monitoring method provided by the present invention, updating the count value of the heartbeat timeout counter according to the current network connectivity status includes:
在确定所述当前网络连通状态为异常连通状态的情况下,将所述心跳超时计数器的计数值累计加1。When it is determined that the current network connectivity state is an abnormal connectivity state, the count value of the heartbeat timeout counter is cumulatively increased by 1.
根据本发明提供的一种节点故障监测方法,所述根据所述当前网络连通状态,对心跳超时计数器的计数值进行更新,包括:According to a node fault monitoring method provided by the present invention, updating the count value of the heartbeat timeout counter according to the current network connectivity status includes:
在确定所述当前网络连通状态为正常连通状态的情况下,将所述心跳超时计数器的计数值保持不变。When it is determined that the current network connectivity state is a normal connectivity state, the count value of the heartbeat timeout counter is kept unchanged.
根据本发明提供的一种节点故障监测方法,所述根据所述当前网络连通状态表,确定与所述第二节点之间的当前网络连通状态,包括:According to a node fault monitoring method provided by the present invention, determining the current network connectivity status with the second node based on the current network connectivity status table includes:
在所述当前网络连通状态表中,查找与所述第二节点之间的连通信息;In the current network connectivity status table, search for connectivity information with the second node;
在查找结果为空的情况下,确定所述当前网络连通状态为异常连通状态。When the search result is empty, it is determined that the current network connectivity state is an abnormal connectivity state.
根据本发明提供的一种节点故障监测方法,所述方法还包括:According to a node fault monitoring method provided by the present invention, the method further includes:
在所述查找结果为查找到所述连通信息的情况下,根据所述连通信息,确定与所述第二节点之间是否断开连接;When the search result is that the connectivity information is found, determine whether to disconnect from the second node based on the connectivity information;
在确定与所述第二节点之间断开连接的情况下,确定所述当前网络连通状态为异常连通状态。When it is determined that the connection with the second node is disconnected, it is determined that the current network connectivity state is an abnormal connectivity state.
根据本发明提供的一种节点故障监测方法,所述方法还包括:According to a node fault monitoring method provided by the present invention, the method further includes:
在确定与所述第二节点之间正常连接的情况下,确定所述当前网络连通状态为正常连通状态。When it is determined that there is a normal connection with the second node, it is determined that the current network connection state is a normal connection state.
根据本发明提供的一种节点故障监测方法,所述向所述分布式集群系统中的第二节点发送第一心跳报文,包括:According to a node fault monitoring method provided by the present invention, sending a first heartbeat message to a second node in the distributed cluster system includes:
根据与所述分布式集群系统中各网络节点之间的网络连通状态生成目标网络连通状态表;Generate a target network connectivity state table according to the network connectivity state with each network node in the distributed cluster system;
根据所述目标网络连通状态表生成所述第一心跳报文;Generate the first heartbeat message according to the target network connectivity status table;
在当前时间与上次发送时间之间的时间间隔满足时间间隔阈值的情况下,向所述第二节点发送所述第一心跳报文。If the time interval between the current time and the last sending time meets the time interval threshold, the first heartbeat message is sent to the second node.
根据本发明提供的一种节点故障监测方法,所述方法还包括:According to a node fault monitoring method provided by the present invention, the method further includes:
根据所述当前网络连通状态表对所述目标网络连通状态表进行更新;Update the target network connectivity status table according to the current network connectivity status table;
根据更新后的目标网络连通状态表以及所述当前心跳超时次数,获取目标故障监测结果;所述目标故障监测结果为所述第一节点的故障监测结果。According to the updated target network connectivity status table and the current number of heartbeat timeouts, a target fault monitoring result is obtained; the target fault monitoring result is a fault monitoring result of the first node.
本发明还提供一种节点故障监测装置,应用于分布式集群系统中的第一节点,包括:The present invention also provides a node fault monitoring device, which is applied to the first node in the distributed cluster system, including:
发送线程,用于向所述分布式集群系统中的第二节点发送第一心跳报文;A sending thread, used to send the first heartbeat message to the second node in the distributed cluster system;
接收线程,用于接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;A receiving thread, configured to receive the second heartbeat message returned by the second node, and obtain the current number of heartbeat timeouts with the second node and the number of heartbeat timeouts between the second node and the second node according to the second heartbeat message. Current network connectivity status table; the second heartbeat message is a response message to the first heartbeat message;
检测线程,用于根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。A detection thread, configured to obtain the fault monitoring result of the second node according to the current number of heartbeat timeouts and the current network connectivity status table.
本发明还提供一种节点故障监测系统,包括分布式集群系统;The invention also provides a node fault monitoring system, including a distributed cluster system;
所述分布式集群系统包括第一节点、多个第二节点,以及集群琐碎数据库;The distributed cluster system includes a first node, a plurality of second nodes, and a cluster trivial database;
所述集群琐碎数据库用于为所述第一节点和所述第二节点提供网络连通状态检测服务;The cluster trivial database is used to provide network connectivity status detection services for the first node and the second node;
所述第一节点用于执行如上述任一项所述节点故障监测方法。The first node is used to execute the node fault monitoring method as described in any one of the above.
本发明还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述节点故障监测方法。The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements any one of the above node fault monitoring methods. method.
本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述节点故障监测方法。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, it implements any one of the above node fault monitoring methods.
本发明还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上述任一种所述节点故障监测方法。The present invention also provides a computer program product, which includes a computer program. When the computer program is executed by a processor, it implements any one of the above node fault monitoring methods.
本发明提供的节点故障监测方法、装置、系统、电子设备及存储介质,通过第一节点向第二节点发送第一心跳报文以及接收第二节点发送的第二心跳报文,以同步第二节点与各网络节点间的当前网络连通状态表,进而根据当前网络连通状态表获取当前心跳超时次数,并根据当前网络连通状态表以及当前心跳超时次数联合获取第二节点的故障监测结果,以使得CTDB在网络亚健康状态下准确的分析出故障节点,防止误判导致正常节点被执行故障切换和故障恢复的动作,由此提高节点检测的稳定性和可靠性,进而提高集群的稳定性、安全性和可靠性。The node fault monitoring method, device, system, electronic equipment and storage medium provided by the present invention send a first heartbeat message to a second node through a first node and receive a second heartbeat message sent by the second node to synchronize the second heartbeat message. The current network connectivity status table between the node and each network node, and then obtain the current heartbeat timeout number based on the current network connectivity status table, and jointly obtain the fault monitoring results of the second node based on the current network connectivity status table and the current heartbeat timeout number, so that CTDB accurately analyzes faulty nodes when the network is in a sub-healthy state, preventing misjudgment from causing normal nodes to perform failover and fault recovery actions, thus improving the stability and reliability of node detection, thereby improving the stability and security of the cluster. performance and reliability.
附图说明Description of drawings
为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are of the present invention. For some embodiments of the invention, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.
图1是本发明提供的节点故障监测方法的流程示意图之一;Figure 1 is one of the flow diagrams of the node fault monitoring method provided by the present invention;
图2是本发明提供的节点故障监测方法的流程示意图之二;Figure 2 is the second schematic flow chart of the node fault monitoring method provided by the present invention;
图3是本发明提供的节点故障监测方法的流程示意图之三;Figure 3 is the third schematic flow chart of the node fault monitoring method provided by the present invention;
图4是本发明提供的节点故障监测方法的流程示意图之四;Figure 4 is the fourth schematic flow chart of the node fault monitoring method provided by the present invention;
图5是本发明提供的各网络节点之间交互的时序示意图;Figure 5 is a timing diagram of interactions between network nodes provided by the present invention;
图6是本发明提供的节点故障监测装置的结构示意图;Figure 6 is a schematic structural diagram of the node fault monitoring device provided by the present invention;
图7是本发明提供的电子设备的结构示意图。Figure 7 is a schematic structural diagram of the electronic device provided by the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention more clear, the technical solutions in the present invention will be clearly and completely described below in conjunction with the accompanying drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.
通常能够让网络正常运行、在遭受外界冲击后可以迅速恢复的状态称为网络健康状态;而网络陷入瘫痪、不能正常运行的状态称为非健康状态。网络正常情况下可以正常运行,但是抵御风险的能力极其低下,在受到突发性网络风险的情况下很容易陷入瘫痪,很长时间都难以恢复的状态称为网络亚健康状态。通常许多大中型企业的网络均处于网络亚健康状态,因此如何在网络亚健康状态下高效精准进行网络节点监测是目前业界亟待解决的重要课题。A state that usually allows the network to operate normally and recover quickly after being impacted by the outside world is called a network health state; while a state in which the network is paralyzed and cannot operate normally is called an unhealthy state. The network can operate normally under normal circumstances, but its ability to resist risks is extremely low. It is easy to be paralyzed when exposed to sudden network risks, and a state that is difficult to recover for a long time is called a sub-health state of the network. Usually the networks of many large and medium-sized enterprises are in a sub-healthy network state. Therefore, how to efficiently and accurately monitor network nodes in a sub-healthy network state is an important issue that needs to be solved urgently in the industry.
相关技术中,通常采用CTDB服务通过PING或者心跳监测的方式,点对点判断其他节点是否向本节点发送响应信息来判断其他网络节点是否为异常节点。但是由于CTDB缺乏判断网口的亚健康状态的逻辑,因此,在网络亚健康的状态下,存在网络异常的节点中的CTDB在通过PING或心跳监测其他节点传输的响应信息丢失,会误认为其他节点均存在故障,从而触发对其他节点的故障切换和恢复动作,影响集群的稳定性和可靠性。因此,在网络亚健康的状态下,CTDB存在误判故障节点的问题,使得节点故障检测精度低,由此影响集群系统的稳定性和可靠性。In related technologies, the CTDB service is usually used to determine whether other network nodes are abnormal nodes by point-to-point judging whether other nodes send response information to this node through PING or heartbeat monitoring. However, since CTDB lacks the logic to judge the sub-health status of the network port, when the network is in a sub-health state, the CTDB in the node with network abnormalities will lose the response information transmitted by other nodes through PING or heartbeat monitoring, and will mistakenly think that other nodes are in a sub-health state. All nodes are faulty, which triggers failover and recovery actions on other nodes, affecting the stability and reliability of the cluster. Therefore, when the network is in a sub-healthy state, CTDB may misjudge faulty nodes, resulting in low node fault detection accuracy, thus affecting the stability and reliability of the cluster system.
基于上述问题,本申请实施例提供一种针对分布式集群系统中的CTDB在网络亚健康状态下的节点故障监测方法、装置、系统、电子设备及存储介质,该方法通过在各节点间互传心跳报文,以便各节点间的网络连通状态记录的同步,并根据心跳报文解析精确获取网络连通状态表以及心跳超时次数,进而根据网络连通状态表以及心跳超时次数进行多重复判,以使得CTDB在网络亚健康状态下准确的分析出故障节点,防止发生误判导致正常节点被执行故障切换和故障恢复的动作,由此提高节点检测的稳定性和可靠性,进而提高了集群的稳定性、安全性和可靠性。Based on the above problems, embodiments of the present application provide a node fault monitoring method, device, system, electronic equipment and storage medium for CTDB in a distributed cluster system when the network is in a sub-healthy state. Heartbeat messages are used to synchronize network connectivity status records between nodes, and the network connectivity status table and heartbeat timeout times are accurately obtained based on heartbeat message analysis, and then multiple re-judgments are made based on the network connectivity status table and heartbeat timeout times, so that CTDB accurately analyzes faulty nodes when the network is in a sub-healthy state, preventing misjudgments that would cause normal nodes to perform failover and fault recovery actions, thus improving the stability and reliability of node detection, thereby improving the stability of the cluster. , safety and reliability.
下面结合图1-图5描述本申请的节点故障监测方法。The node fault monitoring method of this application is described below in conjunction with Figures 1-5.
图1为本申请实施例提供的节点故障监测方法的流程示意图之一,该方法可应用于包含第一节点和多个第二节点的分布式集群系统。该分布式集群系统中的各节点之间的连接,是可以是基于无线保真(Wireless Fidelity,WIFI)、蓝牙等技术建立的无线连接,也可以是通过通用串行总线等技术建立的有线连接,本申请实施例不对各节点之间的连接方式不作限定。Figure 1 is one of the flow diagrams of a node fault monitoring method provided by an embodiment of the present application. This method can be applied to a distributed cluster system including a first node and multiple second nodes. The connection between the nodes in the distributed cluster system can be a wireless connection established based on Wireless Fidelity (WIFI), Bluetooth and other technologies, or a wired connection established through Universal Serial Bus and other technologies. , the embodiments of this application do not limit the connection methods between nodes.
每个节点都存在相同的CTDB服务在运行,也即CTDB服务为分布式服务,以实现各节点之间互相检测网络状态,具体检测方法是通过发送心跳报文或网络亚健康检测报文。The same CTDB service is running on each node, that is, the CTDB service is a distributed service to enable each node to detect the network status of each other. The specific detection method is by sending heartbeat messages or network sub-health detection messages.
每个节点包括至少三个线程,分别为发送线程、接收线程以及检测线程;其中,发送线程用于向对端发送信息,如心跳报文等;接收线程用于接收对端传输的信息,如心跳报文等,检测线程用于对本端节点或对端节点进行故障检测。需要说明的是,上述至少三个线程可以同步运行也可以异步运行,本实施例对此不作具体地限定。Each node includes at least three threads, namely a sending thread, a receiving thread and a detection thread; among them, the sending thread is used to send information to the peer, such as heartbeat messages, etc.; the receiving thread is used to receive information transmitted by the peer, such as Heartbeat messages, etc., the detection thread is used to detect faults on the local node or the opposite node. It should be noted that the at least three threads mentioned above can run synchronously or asynchronously, which is not specifically limited in this embodiment.
所称的节点可以是电子设备。该电子设备可以是移动电子设备,也可以为非移动电子设备。示例性的,移动电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,非移动电子设备可以为服务器、网络附属存储器(Network Attached Storage,NAS)、个人计算机(personalcomputer,PC)、电视机(television,TV)、柜员机或者自助机等,本发明不作具体限定。The so-called nodes may be electronic devices. The electronic device may be a mobile electronic device or a non-mobile electronic device. For example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a handheld computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a personal digital assistant (personal digital assistant). assistant, PDA), etc., the non-mobile electronic device can be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., the present invention does not Specific limitations.
可以理解的是,基于各节点之间的连接,各节点之间可以进行通信,具体可以是各节点之间传输心跳包以进行节点故障监测,或者各节点之间传输业务数据,以进行相应的业务处理等,本实施例对此不作具体地限定。It can be understood that based on the connections between the nodes, communication can be carried out between the nodes. Specifically, heartbeat packets can be transmitted between the nodes for node fault monitoring, or business data can be transmitted between the nodes for corresponding processing. Business processing, etc., this embodiment does not specifically limit this.
其中,第一节点为当前执行节点故障监测的网络节点,其可以是随机或按照预设规则在分布式集群系统中确定的;第二节点为分布式集群系统中除第一节点之外所需进行故障监测的一个或多个网络节点。下文以第二节点的数量为一个为例,对本实施提供的节点故障监测方法展开描述,在第二节点的数量为多个时,可参考此方式对其他第二节点进行故障监测。Among them, the first node is the network node currently performing node failure monitoring, which can be randomly or determined in the distributed cluster system according to preset rules; the second node is required in the distributed cluster system in addition to the first node. One or more network nodes for fault monitoring. The following takes one second node as an example to describe the node fault monitoring method provided by this implementation. When there are multiple second nodes, you can refer to this method to perform fault monitoring on other second nodes.
本实施例的执行主体为第一节点,如图1所示,该方法具体包括如下步骤:The execution subject of this embodiment is the first node, as shown in Figure 1. The method specifically includes the following steps:
步骤101,向所述分布式集群系统中的第二节点发送第一心跳报文;Step 101: Send the first heartbeat message to the second node in the distributed cluster system;
可选地,第一节点实时生成第一心跳报文,并将第一心跳报文发送至分布式集群系统中的第二节点。Optionally, the first node generates the first heartbeat message in real time and sends the first heartbeat message to the second node in the distributed cluster system.
其中,第一心跳报文可以基于节点连通状态测试指令生成的,也可以是基于分布式集群系统中各网络节点与第一节点之间的网络连通状态表生成的,本实施例对第一心跳报文的生成方式不作具体地限定。The first heartbeat message may be generated based on the node connectivity status test instruction, or may be generated based on the network connectivity status table between each network node and the first node in the distributed cluster system. In this embodiment, the first heartbeat message The message generation method is not specifically limited.
需要说明的是,各节点的网络连通状态表是由其内部的CTDB服务进行检测获取,具体检测方式可以是判断各节点与各其他网络节点间的心跳响应时长是否超出各其他网络节点对应的响应超时时长,以确定网络连通状态;或者进一步判断超出响应超时时长的次数是否大于各其他网络节点对应的次数阈值,以确定网络连通状态,由此避免因不同网络节点的响应性能不同,导致网络连通状态获取错误,实现网络连通状态表的精准获取,进而提高节点故障检测的准确性。It should be noted that the network connectivity status table of each node is detected and obtained by its internal CTDB service. The specific detection method can be to determine whether the heartbeat response time between each node and other network nodes exceeds the corresponding response of each other network node. The timeout duration is used to determine the network connectivity status; or it is further determined whether the number of times the response timeout duration is exceeded is greater than the threshold corresponding to each other network node to determine the network connectivity status, thereby avoiding network connectivity problems caused by different response performance of different network nodes. Status acquisition errors enable accurate acquisition of network connectivity status tables, thereby improving the accuracy of node fault detection.
其中,发送方式可以是实时发送,也即第一心跳报文生成即发送,也可以是周期性地发送,本实施例对第一心跳报文的发送方式不作具体地限定。The sending method may be real-time sending, that is, the first heartbeat message is sent as soon as it is generated, or it may be sent periodically. This embodiment does not specifically limit the sending method of the first heartbeat message.
在一些实施例中,向所述分布式集群系统中的第二节点发送第一心跳报文,包括:In some embodiments, sending the first heartbeat message to the second node in the distributed cluster system includes:
根据与所述分布式集群系统中各网络节点之间的网络连通状态生成目标网络连通状态表;Generate a target network connectivity state table according to the network connectivity state with each network node in the distributed cluster system;
根据所述目标网络连通状态表生成所述第一心跳报文;Generate the first heartbeat message according to the target network connectivity status table;
在当前时间与上次发送时间之间的时间间隔满足时间间隔阈值的情况下,向所述第二节点发送所述第一心跳报文。If the time interval between the current time and the last sending time meets the time interval threshold, the first heartbeat message is sent to the second node.
可选地,第一节点在执行节点监测时,启动至少三个线程,分别为发送线程、接收线程以及检测线程,并通过发送线程,收集第一节点与分布式集群系统中各网络节点之间的网络连通状态生成目标网络连通状态表;并将目标网络连通状态表写入心跳报文中,生成第一心跳报文。Optionally, when performing node monitoring, the first node starts at least three threads, namely a sending thread, a receiving thread and a detection thread, and collects data between the first node and each network node in the distributed cluster system through the sending thread. generate a target network connectivity status table based on the network connectivity status; and write the target network connectivity status table into the heartbeat message to generate the first heartbeat message.
接着,判断当前时间与上次发送时间之间的时间间隔是否满足时间间隔阈值,在满足时间间隔阈值的情况下,则向第二节点发送第一心跳报文。其中,时间间隔阈值可以根据实际需求进行设置,如1秒。Next, it is determined whether the time interval between the current time and the last sending time meets the time interval threshold. If the time interval threshold is met, the first heartbeat message is sent to the second node. Among them, the time interval threshold can be set according to actual needs, such as 1 second.
本实施例提供的方法,通过将携带第一节点与分布式集群系统中各网络节点之间的网络连通状态形成的目标网络连通状态表的第一心跳报文至第二节点,以实现各节点之间的网络连接状态实时共享同步,以便第二节点可以根据目标网络连通状态表实时更新得到第二节点的当前网络连通状态表,并将当前网络连通状态表写入第二心跳报文中,并将第二心跳报文以响应信息的形式返回至第一节点,以实现及时发送更加可靠的响应报文,由此提高节点故障监测的准确性,进一步提高集群系统的稳定性、安全性和可靠性。The method provided by this embodiment is to send the first heartbeat message carrying the target network connectivity status table formed by the network connectivity status between the first node and each network node in the distributed cluster system to the second node, so as to realize each node's The network connection status between them is shared and synchronized in real time, so that the second node can update the current network connection status table of the second node in real time according to the target network connection status table, and write the current network connection status table into the second heartbeat message. And the second heartbeat message is returned to the first node in the form of response information to achieve timely sending of more reliable response messages, thereby improving the accuracy of node failure monitoring and further improving the stability, security and safety of the cluster system. reliability.
步骤102,接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;Step 102: Receive the second heartbeat message returned by the second node, and obtain the current number of heartbeat timeouts with the second node and the current network of the second node according to the second heartbeat message. Connectivity status table; the second heartbeat message is a response message to the first heartbeat message;
可选地,在将第一心跳报文发送至第二节点的情况下,第二节点可以从第一心跳报文中解析出第一节点的目标网络连通状态表,以根据目标网络连通状态表对第二节点的网络连通状态表进行更新,得到第二心跳报文,并将第二心跳报文响应给第一节点。Optionally, when sending the first heartbeat message to the second node, the second node can parse the target network connection status table of the first node from the first heartbeat packet to determine the target network connection status table according to the target network connection status table. Update the network connection status table of the second node, obtain the second heartbeat message, and respond to the second heartbeat message to the first node.
第一节点的接收线程,可以实时监听分布式集群系统中的各网络节点,以在第二节点根据第一心跳报文返回第二心跳报文的情况下,可以实时接收第二心跳报文,并在接收到第二心跳报文之后,可以对第二心跳报文进行解析,以获取第二节点的当前网络连通状态表,并将当前网络连通状态表传输至检测线程,由检测线程根据当前网络连通状态表获取当前心跳超时次数。The receiving thread of the first node can monitor each network node in the distributed cluster system in real time, so that when the second node returns the second heartbeat message based on the first heartbeat message, it can receive the second heartbeat message in real time. After receiving the second heartbeat message, the second heartbeat message can be parsed to obtain the current network connectivity status table of the second node, and the current network connectivity status table is transmitted to the detection thread, which uses the current The network connectivity status table obtains the current number of heartbeat timeouts.
其中,获取当前心跳超时次数的方式,可以是将当前网络连通状态表和第一节点与第二节点之间的上一心跳超时次数输入至预先训练的更新模型中,输出当前心跳超时次数;或者采用预先设置的更新规则,对当前网络连通状态表进行分析,以根据分析结果对上一心跳超时次数进行更新,得到当前心跳超时次数,本实施例对此不作具体地限定。The method of obtaining the current number of heartbeat timeouts may be to input the current network connectivity status table and the previous number of heartbeat timeouts between the first node and the second node into a pre-trained update model, and output the current number of heartbeat timeouts; or The current network connectivity status table is analyzed using preset update rules to update the previous heartbeat timeout number based on the analysis results to obtain the current heartbeat timeout number, which is not specifically limited in this embodiment.
步骤103,根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。Step 103: Obtain the fault monitoring result of the second node according to the current number of heartbeat timeouts and the current network connectivity status table.
所称的故障监测结果包括正常状态或故障状态。The so-called fault monitoring results include normal status or fault status.
可选地,检测线程在获取到当前心跳超时次数和当前网络连通状态表之后,可以联合当前心跳超时次数和当前网络连通状态表,获取第二节点的故障监测结果。Optionally, after obtaining the current number of heartbeat timeouts and the current network connectivity status table, the detection thread can combine the current number of heartbeat timeouts and the current network connectivity status table to obtain the fault monitoring results of the second node.
此处,获取第二节点的故障监测结果的方式包括:将当前心跳超时次数和当前网络连通状态表输入至预先训练的检测模型中,由检测模型输出第二节点的故障监测结果;或者,采用一重或多重故障监测判断条件,对当前心跳超时次数和当前网络连通状态表进行判断分析,以得到第二节点的故障监测结果,本实施例对此不作具体地限定。Here, the method of obtaining the fault monitoring results of the second node includes: inputting the current number of heartbeat timeouts and the current network connectivity status table into a pre-trained detection model, and having the detection model output the fault monitoring results of the second node; or, using One or multiple fault monitoring judgment conditions include judging and analyzing the current number of heartbeat timeouts and the current network connectivity status table to obtain the fault monitoring results of the second node, which is not specifically limited in this embodiment.
本申请实施例提供的节点故障监测方法,通过第一节点向第二节点发送第一心跳报文以及接收第二节点发送的第二心跳报文,以同步第二节点与各网络节点间的当前网络连通状态表,进而根据当前网络连通状态表获取当前心跳超时次数,并根据当前网络连通状态表以及当前心跳超时次数联合获取第二节点的故障监测结果,以使得CTDB在网络亚健康状态下准确的分析出故障节点,防止误判导致正常节点被执行故障切换和故障恢复的动作,由此提高节点检测的稳定性和可靠性,进而提高集群的稳定性、安全性和可靠性。The node fault monitoring method provided by the embodiment of the present application uses the first node to send a first heartbeat message to the second node and receive the second heartbeat message sent by the second node to synchronize the current status between the second node and each network node. Network connectivity status table, and then obtain the current number of heartbeat timeouts based on the current network connectivity status table, and jointly obtain the fault monitoring results of the second node based on the current network connectivity status table and the current number of heartbeat timeouts, so that CTDB can accurately detect the sub-health status of the network It analyzes faulty nodes to prevent misjudgment from causing normal nodes to perform failover and fault recovery actions, thus improving the stability and reliability of node detection, thereby improving the stability, security and reliability of the cluster.
在一些实施例中,所述根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果,包括:In some embodiments, obtaining the fault monitoring result of the second node based on the current number of heartbeat timeouts and the current network connectivity status table includes:
将所述当前心跳超时次数与次数阈值进行比较,得到第一比较结果;Compare the current number of heartbeat timeouts with a number threshold to obtain a first comparison result;
在根据所述第一比较结果,确定所述当前心跳超时次数大于所述次数阈值的情况下,根据所述当前网络连通状态表,判断所述分布式集群系统中是否存在至少一个第三节点与所述第二节点之间的网络连通状态为正常状态;When it is determined that the current number of heartbeat timeouts is greater than the number threshold according to the first comparison result, it is determined according to the current network connectivity status table whether there is at least one third node in the distributed cluster system. The network connectivity state between the second nodes is a normal state;
根据判断结果,获取所述第二节点的故障监测结果;According to the judgment result, obtain the fault monitoring result of the second node;
其中,所述第三节点为所述分布式集群系统中除所述第一节点和所述第二节点之外的网络节点。Wherein, the third node is a network node other than the first node and the second node in the distributed cluster system.
可选地,步骤103中获取第二节点的故障监测结果的步骤具体包括:Optionally, the step of obtaining the fault monitoring result of the second node in step 103 specifically includes:
检测线程在获取到当前心跳超时次数和当前网络连通状态表之后,将当前心跳超时次数与次数阈值进行比较,以确定当前心跳超时次数是否大于次数阈值。此处的次数阈值可以根据第一节点的容错性能进行设置。After obtaining the current number of heartbeat timeouts and the current network connectivity status table, the detection thread compares the current number of heartbeat timeouts with the number threshold to determine whether the current number of heartbeat timeouts is greater than the number threshold. The times threshold here can be set according to the fault tolerance performance of the first node.
在确定当前心跳超时次数大于次数阈值的情况下,表征第二节点与第一节点之间通信存在异常情况,需要根据当前网络连通状态表进一步确定第二节点是否为故障节点。When it is determined that the current number of heartbeat timeouts is greater than the number threshold, it indicates that there is an abnormality in the communication between the second node and the first node, and it is necessary to further determine whether the second node is a faulty node according to the current network connectivity status table.
可选地,根据当前网络连通状态表,获取第二节点与分布式集群系统中除所述第一节点和所述第二节点之外的网络节点,也即第三节点之间的网络连通状态,以根据网络连通状态,确定第二节点是否与至少一个第三节点之间的网络连通状态为正常状态,以根据判断结果进一步确定第二节点是否为故障节点。Optionally, according to the current network connectivity status table, obtain the network connectivity status between the second node and the network nodes in the distributed cluster system other than the first node and the second node, that is, the third node. , to determine whether the network connectivity state between the second node and at least one third node is a normal state according to the network connectivity state, and to further determine whether the second node is a fault node based on the determination result.
本实施例提供的方法,通过对当前心跳超时次数进行判断以及对当前网络连通状态表进行判断,以采用多重复判的方式进行节点故障监测,使得CTDB服务在网络亚健康状态下准确的分析出第二节点是否故障节点,进而防止发生误判导致正常节点被执行故障切换和故障恢复的动作,实现提高节点故障检测精度,由此提高集群系统的稳定性和可靠性。The method provided in this embodiment uses multiple re-judgments to monitor node faults by judging the current number of heartbeat timeouts and the current network connectivity status table, so that the CTDB service can accurately analyze and detect faults in a sub-healthy network state. Whether the second node is a faulty node can prevent misjudgment and cause normal nodes to perform failover and fault recovery actions, thereby improving the accuracy of node failure detection, thus improving the stability and reliability of the cluster system.
在一些实施例中,所述根据判断结果,获取所述第二节点的故障监测结果,包括:In some embodiments, obtaining the fault monitoring result of the second node according to the judgment result includes:
在根据所述判断结果,确定所述分布式集群系统中不存在至少一个所述第三节点与所述第二节点之间的网络连通状态为正常状态的情况下,确定所述第二节点的故障监测结果为故障状态。When it is determined according to the judgment result that there is no at least one network connection state between the third node and the second node in the distributed cluster system and the network connectivity state between the second node and the third node is in a normal state, it is determined that the second node The fault monitoring result is fault status.
可选地,对于确定分布式集群系统中不存在至少一个第三节点与第二节点之间的网络连通状态为正常状态的情况下,也即分布式集群系统中所有第三节点与第二节点之间的网络连通状态均为异常状态,此时表明第二节点与分布式集群系统中的任何网络节点均存在异常通信,此时确定第二节点为故障节点。Optionally, when it is determined that there is no network connection status between at least one third node and a second node in the distributed cluster system and the network connection status is normal, that is, all third nodes and second nodes in the distributed cluster system The network connectivity status between them is all abnormal. At this time, it indicates that there is abnormal communication between the second node and any network node in the distributed cluster system. At this time, the second node is determined to be a fault node.
本实施例提供的方法,通过对当前心跳超时次数进行判断以及对当前网络连通状态表进行判断的多重复判,以在多重复判确定第二节点满足故障条件的情况下,才确定第二节点为故障节点,使得节点故障检测精度更高,由此提高集群系统的稳定性和可靠性。The method provided in this embodiment uses multiple re-judgments to determine the current number of heartbeat timeouts and the current network connectivity status table, so that the second node is determined only after multiple re-judgments determine that the second node meets the fault condition. For faulty nodes, the accuracy of node fault detection is higher, thereby improving the stability and reliability of the cluster system.
在一些实施例中,所述根据判断结果,获取所述第二节点的故障监测结果,包括:In some embodiments, obtaining the fault monitoring result of the second node according to the judgment result includes:
在根据所述判断结果,确定所述分布式集群系统中存在至少一个所述第三节点与所述第二节点之间的网络连通状态为正常状态的情况下,获取所述第二节点对应的可参考节点的数量;When it is determined that there is at least one third node in the distributed cluster system and the network connectivity state between the third node and the second node is normal according to the judgment result, obtain the data corresponding to the second node. The number of reference nodes;
根据所述可参考节点的数量,获取所述第二节点的故障监测结果;Obtain the fault monitoring result of the second node according to the number of reference nodes;
其中,所述可参考节点用于在预设周期内为更新得到所述第二节点的当前网络连通状态表提供响应报文。Wherein, the reference node is used to provide a response message for updating the current network connectivity status table of the second node within a preset period.
可选地,对于确定分布式集群系统中存在至少一个第三节点与第二节点之间的网络连通状态为正常状态的情况下,也即分布式集群系统中并非所有第三节点与第二节点之间的网络连通状态均为异常状态,此时表明第二节点与分布式集群系统中的部分网络节点存在正常通信,为了更加准确地确定第二节点是否为故障节点,还需要进一步地进行故障检测条件复判。Optionally, when it is determined that the network connectivity status between at least one third node and the second node in the distributed cluster system is normal, that is, not all third nodes and second nodes in the distributed cluster system The network connectivity status between them is all abnormal, which indicates that there is normal communication between the second node and some network nodes in the distributed cluster system. In order to more accurately determine whether the second node is a faulty node, further troubleshooting is required. The detection conditions are re-judged.
可选地,获取在预设周期内,分布式集群系统中为更新得到第二节点的当前网络连通状态表提供响应报文的网络节点,也即在预设周期内与第二节点保持正常网络通信连接的网络节点,以得到可参考节点。此处的预设周期可以包括当前周期,或者包括当前周期和当前周期之前的多个周期,具体可以根据第一节点的容错性能进行确定,本实施例对此不做具体地限定。Optionally, obtain the network node in the distributed cluster system that provides a response message for updating the current network connectivity status table of the second node within a preset period, that is, maintaining a normal network with the second node within the preset period. Communication connected network nodes to obtain reference nodes. The preset period here may include the current period, or include the current period and multiple periods before the current period. Specifically, it may be determined based on the fault tolerance performance of the first node, which is not specifically limited in this embodiment.
接着,统计可参考节点的数量,将可参考节点的数量与数量阈值进行比较,以确定可参考节点的数量是否大于数量阈值,在可参考节点的数量大于数量阈值的情况下,确定第二节点的故障监测结果为正常状态。数量阈值可以根据实际需求进行设置,如0。Next, the number of reference nodes is counted, and the number of reference nodes is compared with the quantity threshold to determine whether the number of reference nodes is greater than the quantity threshold. If the number of reference nodes is greater than the quantity threshold, the second node is determined. The fault monitoring result is normal. The quantity threshold can be set according to actual needs, such as 0.
在一些实施例中,所述方法还包括:In some embodiments, the method further includes:
在根据所述第二比较结果,确定所述可参考节点的数量小于或等于所述数量阈值的情况下,确定所述第二节点的故障监测结果为故障状态。If it is determined that the number of reference nodes is less than or equal to the number threshold according to the second comparison result, it is determined that the fault monitoring result of the second node is in a fault state.
可选地,在可参考节点的数量小于或等于数量阈值的情况下,则进一步表征第二节点通信不稳定,确定第二节点为故障节点。Optionally, when the number of reference nodes is less than or equal to the number threshold, it is further characterized that the communication of the second node is unstable, and the second node is determined to be a fault node.
本实施例提供的方法,在确定分布式集群系统中存在至少一个第三节点与第二节点之间的网络连通状态为正常状态的情况下,进一步判断在预设周期内与第二节点保持正常网络通信连接的可参考节点的数量是否大于数量阈值,以进一步判断第二节点的通信稳定性,由此精准确定第二节点是否为故障节点,进而提高集群系统的稳定性和可靠性。The method provided by this embodiment further determines that the network connection status between at least one third node and the second node in the distributed cluster system is normal, and further determines that the network connection status with the second node remains normal within a preset period. Whether the number of reference nodes connected by network communication is greater than the number threshold is used to further determine the communication stability of the second node, thereby accurately determining whether the second node is a fault node, thereby improving the stability and reliability of the cluster system.
在一些实施例中,所述方法还包括:In some embodiments, the method further includes:
在根据所述第二比较结果,确定所述可参考节点的数量大于所述数量阈值的情况下,触发隔离动作;If, according to the second comparison result, it is determined that the number of reference nodes is greater than the number threshold, trigger an isolation action;
其中,所述隔离动作用于将所述第一节点与所述分布式集群系统中除所述第一节点之外的其他网络节点进行隔离,或者将所述第一节点的网口与所述其他网络节点的网口进行隔离。Wherein, the isolation action is used to isolate the first node from other network nodes in the distributed cluster system except the first node, or to isolate the network port of the first node from the The network ports of other network nodes are isolated.
可选地,在确定可参考节点的数量大于数量阈值的情况下,可以进一步确定第一节点处于不可知状态,为了提高后续节点故障检测的准确性,此时可以触发隔离动作,以将第一节点与分布式集群系统中除第一节点之外的其他网络节点进行隔离,或者将第一节点的网口与其他网络节点的网口进行隔离,也即将第一节点设置为Ban(禁令禁止)状态,使得第一节点不参与其他网络节点的故障检测逻辑。Optionally, when it is determined that the number of reference nodes is greater than the number threshold, it can be further determined that the first node is in an unknown state. In order to improve the accuracy of subsequent node fault detection, an isolation action can be triggered at this time to remove the first node. Isolate the node from other network nodes in the distributed cluster system except the first node, or isolate the network port of the first node from the network ports of other network nodes, that is, set the first node to Ban (banned) status so that the first node does not participate in the fault detection logic of other network nodes.
在一些实施例中,所述方法还包括:In some embodiments, the method further includes:
在确定所述第二节点的故障监测结果为故障状态的情况下,在所述分布式集群系统中获取第四节点;所述第四节点为故障监测结果为正常状态,且与所述第二节点具有相同的服务功能的网络节点;When it is determined that the fault monitoring result of the second node is in a fault state, a fourth node is obtained in the distributed cluster system; the fourth node has a fault monitoring result in a normal state and is in a normal state with the second node. Nodes are network nodes with the same service functions;
将所述第二节点的待处理任务迁移至所述第四节点;Migrate the pending tasks of the second node to the fourth node;
在所述第二节点的故障监测结果由故障状态切换为正常状态的情况下,将所述待处理任务恢复至所述第二节点。When the fault monitoring result of the second node switches from a fault state to a normal state, the pending task is restored to the second node.
其中,所称的服务功能包括监测功能、数据处理功能、数据储存功能、数据转发功能中的一种或多种组合,本实施例对此不作具体地限定。The so-called service functions include one or more combinations of monitoring functions, data processing functions, data storage functions, and data forwarding functions, which are not specifically limited in this embodiment.
可选地,在根据故障监测结果确定第二节点为故障节点,也即为故障状态的情况下,可以分布式集群系统中获取故障监测结果为正常状态,且与所述第二节点具有相同的服务功能的网络节点,作为第四节点,以便基于第四节点对第二节点进行故障切换,也即将第二节点的待处理任务迁移至第四节点,以便第四节点对第二节点的待处理任务进行业务处理,维持分布式集群系统的正常运行,确保分布式集群系统的稳定性、安全性和可靠性。Optionally, when it is determined that the second node is a faulty node according to the fault monitoring result, that is, it is in a faulty state, the fault monitoring result can be obtained in the distributed cluster system to be in a normal state and have the same node as the second node. The network node of the service function serves as the fourth node to failover the second node based on the fourth node, that is, to migrate the pending tasks of the second node to the fourth node so that the fourth node can perform the pending tasks of the second node. Tasks perform business processing, maintain the normal operation of the distributed cluster system, and ensure the stability, security and reliability of the distributed cluster system.
在第二节点的故障监测结果由故障状态切换为正常状态的情况下,可对第二节点进行故障恢复,也即将待处理任务恢复至第二节点,以便继续进行业务处理。When the fault monitoring result of the second node switches from the fault state to the normal state, the second node can be fault restored, that is, the pending tasks can be restored to the second node to continue business processing.
本实施例提供的方法,在第二节点为故障节点的情况下,可对二节点的待处理任务进行快速迁移和恢复,提高分布式集群系统的稳定性、安全性和可靠性。The method provided in this embodiment can quickly migrate and restore the tasks to be processed on the second node when the second node is a faulty node, thereby improving the stability, security, and reliability of the distributed cluster system.
在一些实施例中,所述根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表,包括:In some embodiments, obtaining the current number of heartbeat timeouts with the second node and the current network connectivity status table of the second node according to the second heartbeat message includes:
对所述第二心跳报文进行解析,得到所述当前网络连通状态表;Parse the second heartbeat message to obtain the current network connectivity status table;
根据所述当前网络连通状态表,确定与所述第二节点之间的当前网络连通状态;Determine the current network connectivity state with the second node according to the current network connectivity state table;
根据所述当前网络连通状态,对心跳超时计数器的计数值进行更新;Update the count value of the heartbeat timeout counter according to the current network connectivity status;
根据更新后的计数值,获取所述当前心跳超时次数。According to the updated count value, obtain the current number of heartbeat timeouts.
图2为本申请实施例提供的节点故障监测方法的流程示意图之二;如图2所示,步骤102中获取当前心跳超时次数和当前网络连通状态表的步骤包括:Figure 2 is a flow diagram of the second node fault monitoring method provided by the embodiment of the present application; as shown in Figure 2, the steps of obtaining the current number of heartbeat timeouts and the current network connectivity status table in step 102 include:
步骤1021,对第二心跳报文进行解析,以得到当前网络连通状态表;从当前网络连通状态表中获取第一节点与第二节点之间的当前网络连通状态;Step 1021, parse the second heartbeat message to obtain the current network connectivity status table; obtain the current network connectivity status between the first node and the second node from the current network connectivity status table;
此处的当前网络连通状态可以根据当前网络连通状态表中所包括的第一节点与第二节点之间的连通信息的存储状态和/或内容进行判断获取。The current network connectivity status here can be determined and obtained based on the storage status and/or content of the connectivity information between the first node and the second node included in the current network connectivity status table.
在一些实施例中,确定当前网络连通状态的步骤包括:In some embodiments, the step of determining the current network connectivity status includes:
在所述当前网络连通状态表中,查找与所述第二节点之间的连通信息;In the current network connectivity status table, search for connectivity information with the second node;
在查找结果为空的情况下,确定所述当前网络连通状态为异常连通状态。When the search result is empty, it is determined that the current network connectivity state is an abnormal connectivity state.
可选地,在当前网络连通状态表中获取第二节点与各网络节点之间的连通信息集合,再根据第一节点的标识和第二节点的标识,在连通信息集合中查找第一节点与第二节点之间连通信息,并确定查找结果是否为空,也即第一节点与第二节点之间的连通信息是否被删除。此处的连通信息与第一节点的标识和第二节点的标识形成的组合标识(对第一节点的标识和第二节点的标识进行直接拼接)或者编码标识(对第一节点的标识和第二节点的标识进行重新编码)之间预先建立有映射关系。Optionally, obtain the connection information set between the second node and each network node in the current network connection status table, and then search the connection information set between the first node and the first node according to the identification of the first node and the identification of the second node. The connection information between the second nodes is determined, and whether the search result is empty, that is, whether the connection information between the first node and the second node is deleted. The connectivity information here is a combined identification formed by the identification of the first node and the identification of the second node (direct splicing of the identification of the first node and the identification of the second node) or a coded identification (the identification of the first node and the identification of the second node). There is a pre-established mapping relationship between the two node identifiers).
在确定第一节点与第二节点之间的连通信息被删除的情况下,表征第一节点与第二节点之间的当前网络连通状态为异常连通状态。When it is determined that the connectivity information between the first node and the second node is deleted, it indicates that the current network connectivity state between the first node and the second node is an abnormal connectivity state.
在一些实施例中,所述方法还包括:In some embodiments, the method further includes:
在所述查找结果为查找到所述连通信息的情况下,根据所述连通信息,确定与所述第二节点之间是否断开连接;When the search result is that the connectivity information is found, determine whether to disconnect from the second node based on the connectivity information;
在确定与所述第二节点之间断开连接的情况下,确定所述当前网络连通状态为异常连通状态。When it is determined that the connection with the second node is disconnected, it is determined that the current network connectivity state is an abnormal connectivity state.
可选地,在确定查找结果为查找到连通信息的情况下,需要进一步根据连通信息的内容,确定第一节点与第二节点之间是否断开连接,在确定第一节点与第二节点之间断开连接的情况下,确定第一节点与第二节点当前网络连通状态为异常连通状态。Optionally, when it is determined that the search result is that connectivity information is found, it is necessary to further determine whether the first node and the second node are disconnected based on the content of the connectivity information. In the case of intermittent disconnection, it is determined that the current network connectivity state of the first node and the second node is an abnormal connectivity state.
在一些实施例中,所述方法还包括:In some embodiments, the method further includes:
在确定与所述第二节点之间正常连接的情况下,确定所述当前网络连通状态为正常连通状态。When it is determined that there is a normal connection with the second node, it is determined that the current network connection state is a normal connection state.
可选地,在确定第一节点与第二节点之间正常连接的情况下,表征第一节点与第二节点当前网络连通状态为正常连通状态。Optionally, when it is determined that there is a normal connection between the first node and the second node, it is indicated that the current network connectivity state of the first node and the second node is the normal connectivity state.
步骤1022,根据当前网络连通状态,对心跳超时计数器的计数值进行更新,将更新的计数值作为当前心跳超时次数。所称的心跳超时计数器为用于记录心跳超时次数的计数器。Step 1022: Update the count value of the heartbeat timeout counter according to the current network connectivity status, and use the updated count value as the current number of heartbeat timeouts. The so-called heartbeat timeout counter is a counter used to record the number of heartbeat timeouts.
需要说明的是,在每次检测到第二节点为故障节点的情况下,需要将心跳超时计数器的计数值归零,以便后续继续提供故障检测参数。It should be noted that every time it is detected that the second node is a faulty node, the count value of the heartbeat timeout counter needs to be reset to zero in order to continue to provide fault detection parameters in the future.
所称的更新方式,可以是将当前网络连通状态以及心跳超时计数器的计数值输入至预先训练的更新模型,输出当前心跳超时次数;或者,采用预先判断规则,对当前网络连通状态进行判断分析,以根据判断分析结果确定如何对心跳超时计数器的计数值进行更新,以得到当前心跳超时次数。The so-called update method can be to input the current network connectivity status and the count value of the heartbeat timeout counter into the pre-trained update model and output the current number of heartbeat timeouts; or, use pre-judgment rules to judge and analyze the current network connectivity status, Determine how to update the count value of the heartbeat timeout counter based on the judgment and analysis results to obtain the current number of heartbeat timeouts.
在一些实施例中,所述根据所述当前网络连通状态,对心跳超时计数器的计数值进行更新,包括:In some embodiments, updating the count value of the heartbeat timeout counter according to the current network connectivity status includes:
在确定所述当前网络连通状态为异常连通状态的情况下,将所述心跳超时计数器的计数值累计加1。When it is determined that the current network connectivity state is an abnormal connectivity state, the count value of the heartbeat timeout counter is cumulatively increased by 1.
可选地,对于当前网络连通状态为异常连通状态的情况下,可以确定第一节点与第二节点之间的心跳报文传输时长超时,此时将心跳超时计数器的计数值累计加1,以得到第二节点对应的当前心跳超时次数。Optionally, when the current network connectivity state is an abnormal connectivity state, it can be determined that the heartbeat message transmission duration between the first node and the second node has timed out, and at this time, the count value of the heartbeat timeout counter is cumulatively increased by 1, so as to Get the current number of heartbeat timeouts corresponding to the second node.
在一些实施例中,所述根据所述当前网络连通状态,对心跳超时计数器的计数值进行更新,包括:In some embodiments, updating the count value of the heartbeat timeout counter according to the current network connectivity status includes:
在确定所述当前网络连通状态为正常连通状态的情况下,将所述心跳超时计数器的计数值保持不变。When it is determined that the current network connectivity state is a normal connectivity state, the count value of the heartbeat timeout counter is kept unchanged.
可选地,对于当前网络连通状态为正常连通状态的情况下,可以确定第一节点与第二节点之间的心跳报文传输时长未超时,此时保持心跳超时计数器的计数值不变,以得到第二节点对应的当前心跳超时次数。Optionally, when the current network connectivity state is a normal connectivity state, it can be determined that the heartbeat message transmission duration between the first node and the second node has not timed out, and at this time, the count value of the heartbeat timeout counter remains unchanged, so as to Get the current number of heartbeat timeouts corresponding to the second node.
本实施例提供的方法,通过对第一节点与第二节点之间的当前网络连通状态进行判断,即可实时精准地获取第二当前心跳超时次数,并联合此心跳超时次数对节点进行多重复判以实现节点的故障监测,由此提高故障监测准确性,提高集群系统的稳定性和可靠性。The method provided in this embodiment can accurately obtain the second current heartbeat timeout number in real time by judging the current network connectivity status between the first node and the second node, and combine the heartbeat timeout number to perform multiple repetitions on the node. It can realize node fault monitoring, thereby improving the accuracy of fault monitoring and improving the stability and reliability of the cluster system.
在一些实施例中,所述方法还包括:In some embodiments, the method further includes:
根据所述当前网络连通状态表对所述目标网络连通状态表进行更新;Update the target network connectivity status table according to the current network connectivity status table;
根据更新后的目标网络连通状态表以及所述当前心跳超时次数,获取目标故障监测结果;所述目标故障监测结果为所述第一节点的故障监测结果。According to the updated target network connectivity status table and the current number of heartbeat timeouts, a target fault monitoring result is obtained; the target fault monitoring result is a fault monitoring result of the first node.
图3为本申请实施例提供的节点故障监测方法的流程示意图之三;如图3所示,对本节点(也即第一节点)进行故障监测的步骤包括:Figure 3 is the third flow diagram of the node fault monitoring method provided by the embodiment of the present application; as shown in Figure 3, the steps of performing fault monitoring on this node (that is, the first node) include:
步骤301,获取第一节点与各网络节点之间的网络连通状态,得到第一节点的目标网络连通状态表;Step 301: Obtain the network connectivity status between the first node and each network node, and obtain the target network connectivity status table of the first node;
步骤302,向第二节点发送携带有目标网络连通状态表的第一心跳报文;Step 302: Send the first heartbeat message carrying the target network connectivity status table to the second node;
步骤303,接收第二节点返回的第二心跳报文,获取第二节点的网络连通状态表,根据当前网络连通状态表对第一节点的网络连通状态表进行更新,得到更新后的目标网络连通状态表;并且根据当前网络连通状态表获取与第二节点之间的当前心跳超时次数;Step 303: Receive the second heartbeat message returned by the second node, obtain the network connectivity status table of the second node, update the network connectivity status table of the first node according to the current network connectivity status table, and obtain the updated target network connectivity status table; and obtain the current number of heartbeat timeouts with the second node according to the current network connectivity status table;
步骤304,根据更新后的目标网络连通状态表和当前心跳超时次数,对本节点(也即第一节点)进行故障监测。Step 304: Perform fault monitoring on the current node (that is, the first node) based on the updated target network connectivity status table and the current number of heartbeat timeouts.
可选地,在确定当前心跳超时次数大于次数阈值的情况下,确定第二节点与本节点之间的传输的心跳报文存在超时。此时需要根据更新后的目标网络连通状态表,判断分布式集群系统中是否存在至少一个第三节点与第一节点之间的网络连通状态为正常状态,若不存在,则确定本节点故障,触发本节点的故障切换;若存在,则进一步获取第一节点对应的可参考节点的数量,判断可参考节点的数量是否大于数量阈值,若大于数量阈值,则确定本节点处于不可知状态,将本节点设置为Ban状态,不再参与其他网络节点的检测逻辑;若不大于数量阈值,则确定本节点为故障节点,触发本节点的故障切换。Optionally, when it is determined that the current number of heartbeat timeouts is greater than the number threshold, it is determined that the heartbeat message transmitted between the second node and the current node has timed out. At this time, it is necessary to determine whether the network connectivity status between at least one third node and the first node in the distributed cluster system is normal based on the updated target network connectivity status table. If not, determine that the node is faulty. Trigger the failover of this node; if it exists, further obtain the number of reference nodes corresponding to the first node, and determine whether the number of reference nodes is greater than the quantity threshold. If it is greater than the quantity threshold, it is determined that this node is in an unknown state, and the This node is set to the Ban state and no longer participates in the detection logic of other network nodes; if it is not greater than the number threshold, it is determined that this node is a faulty node and the failover of this node is triggered.
本实施例提供的方法,通过第一节点的更新后的目标网络连通状态表以及当前心跳超时次数进行多重复判,可以对本节点进行实时故障监测,以使得CTDB在网络亚健康状态下准确的分析出第一节点是否为故障节点,防止误判导致正常节点被执行故障切换和故障恢复的动作,由此提高节点检测的稳定性和可靠性,进而提高集群的稳定性、安全性和可靠性。The method provided in this embodiment can perform real-time fault monitoring on this node through multiple re-judgments based on the updated target network connectivity status table of the first node and the current number of heartbeat timeouts, so that CTDB can accurately analyze the network in a sub-healthy state. It detects whether the first node is a faulty node and prevents misjudgment from causing the normal node to perform failover and fault recovery actions, thereby improving the stability and reliability of node detection and thereby improving the stability, security and reliability of the cluster.
图4为本申请实施例提供的节点故障监测方法的流程示意图之四;图5为本申请实施例提供的各网络节点之间交互的时序示意图;如图4和图5所示,以节点1为第一节点,节点2为第二节点,节点3为除第一节点和第二节点之外的其他节点为例,对本实施提供的节点故障监测方法展开描述;Figure 4 is a schematic flow chart of the node fault monitoring method provided by the embodiment of the present application. Figure 5 is a timing diagram of the interaction between network nodes provided by the embodiment of the present application. As shown in Figures 4 and 5, node 1 is the first node, node 2 is the second node, and node 3 is a node other than the first node and the second node. As an example, the node fault monitoring method provided by this implementation is described;
可选地,启动第一节点(下文也称本节点或节点1)的集群琐碎数据库中所部署的发送线程、接收线程以及检测线程以对各第二节点(下文也称节点2)进行故障监测,具体步骤包括:Optionally, start the sending thread, receiving thread and detection thread deployed in the cluster trivial database of the first node (hereinafter also referred to as this node or node 1) to perform fault monitoring on each second node (hereinafter also referred to as node 2) , specific steps include:
步骤401,发送线程根据本节点与各网络节点的网络连接状态生成第一心跳报文;Step 401: The sending thread generates the first heartbeat message according to the network connection status of the node and each network node;
步骤402,发送线程发送第一心跳报文;Step 402: The sending thread sends the first heartbeat message;
步骤403,接收线程接收第二节点返回的第二心跳报文;其中,第二心跳报文,是第二节点在接收到第一心跳报文时,向各其他网络节点(如节点3)发送心跳报文,并根据各其他网络节点返回的携带网络连接状态表的心跳报文对第二节点的网络连接状态表进行更新后生成的。Step 403: The receiving thread receives the second heartbeat message returned by the second node; the second heartbeat message is sent by the second node to each other network node (such as node 3) when receiving the first heartbeat message. The heartbeat message is generated by updating the network connection state table of the second node based on the heartbeat message carrying the network connection state table returned by each other network node.
步骤404,接收线程对第二心跳报文进行解析,获取第二节点的当前网络连通状态表,并根据当前网络连通状态表对本节点的目标网络连通状态表进行更新;Step 404: The receiving thread parses the second heartbeat message, obtains the current network connectivity status table of the second node, and updates the target network connectivity status table of the current node according to the current network connectivity status table;
步骤405,检测线程根据当前网络连通状态表循环判断本节点与第二节点之间的心跳超时情况,具体通过步骤406到步骤408实现;Step 405: The detection thread cyclically determines the heartbeat timeout between the current node and the second node based on the current network connectivity status table, which is specifically implemented through steps 406 to 408;
步骤406,检测线程根据当前网络连通状态表判断本节点与第二节点之间的连通信息是否被删除,或者本节点与第二节点之间是否断开连接,若删除或断开连接,则执行步骤408,若否,则执行步骤407;Step 406: The detection thread determines whether the connection information between the current node and the second node is deleted according to the current network connection status table, or whether the connection between the current node and the second node is disconnected. If the connection is deleted or disconnected, execute Step 408, if not, execute step 407;
步骤407,检测线程刷新本节点与第二节点之间的网络连接状态为正常状态,心跳超时计数器的计数值保持不变,执行步骤409;Step 407: The detection thread refreshes the network connection status between the current node and the second node to the normal state, and the count value of the heartbeat timeout counter remains unchanged. Step 409 is executed;
步骤408,检测线程刷新本节点与第二节点之间的网络连接状态为异常状态,心跳超时计数器的计数值累计加1,执行步骤409;Step 408: The detection thread refreshes the network connection status between this node and the second node to an abnormal state, and the count value of the heartbeat timeout counter is cumulatively increased by 1, and step 409 is executed;
步骤409,判断本节点与第二节点之间的当前心跳超时次数是否大于次数阈值,若大于次数阈值,则执行步骤410;若否,则确定第二节点为正常节点,并退出;Step 409: Determine whether the current number of heartbeat timeouts between the current node and the second node is greater than the number threshold. If it is greater than the number threshold, execute step 410; if not, determine that the second node is a normal node and exit;
步骤410,判断是否存在至少一个网络节点与第二节点之间的网络连通状态为正常状态,若存在,则执行步骤411,若不存在,则确定第二节点为故障节点,执行步骤413;Step 410: Determine whether the network connectivity status between at least one network node and the second node is normal. If it exists, perform step 411. If it does not exist, determine whether the second node is a fault node and perform step 413;
步骤411,判断可参考节点的数量,若数量大于0,则确定第二节点为正常节点,执行步骤412,若否,则确定第二节点为故障节点,执行步骤413;Step 411, determine the number of reference nodes. If the number is greater than 0, determine that the second node is a normal node, and execute step 412. If not, determine that the second node is a fault node, and execute step 413;
步骤412,不对第二节点进行节点故障切换,将本节点更新为BAN状态,并退出;Step 412: No node failover is performed on the second node, the current node is updated to the BAN state, and exits;
步骤413,对第二节点进行节点故障切换。Step 413: Perform node failover on the second node.
需要说明的是,若确定本节点为故障节点的情况下,可对本节点进行网口隔离或节点隔离的动作,以确保分布式集群系统的稳定性。It should be noted that if it is determined that this node is a faulty node, network port isolation or node isolation can be performed on this node to ensure the stability of the distributed cluster system.
此外,当第一节点发现与第二节点间存在心跳超时的情况下,第一节点还可实时通知其他网络节点,其发现了一个与其网络通信存在问题的节点,并触发本节点与问题节点间的网络亚健康检测;其他节点接收到打开网络亚健康检测的消息后,也会触发自己节点与问题节点间的网络亚健康检测,并将其他节点与第二节点的网络亚健康检测进行汇总后,当发现超过半数的网络节点均检测出第二节点存在网络亚健康,则触发第二节点的网口隔离或节点隔离的动作,以确保分布式集群系统的稳定性。下面对本发明提供的节点故障监测装置进行描述,下文描述的节点故障监测装置与上文描述的节点故障监测方法可相互对应参照。In addition, when the first node discovers that there is a heartbeat timeout between the first node and the second node, the first node can also notify other network nodes in real time that it has discovered a node that has problems with its network communication, and triggers a communication between this node and the problem node. Network sub-health detection; after other nodes receive the message to turn on network sub-health detection, they will also trigger the network sub-health detection between their own nodes and the problem node, and summarize the network sub-health detection of other nodes and the second node. , when it is found that more than half of the network nodes detect that the second node has network sub-health, the second node's network port isolation or node isolation action is triggered to ensure the stability of the distributed cluster system. The node fault monitoring device provided by the present invention is described below. The node fault monitoring device described below and the node fault monitoring method described above can be mutually referenced.
图6为本申请实施例提供的节点故障监测装置的结构示意图。如图6所示,该装置包括:Figure 6 is a schematic structural diagram of a node fault monitoring device provided by an embodiment of the present application. As shown in Figure 6, the device includes:
发送线程601用于向所述分布式集群系统中的第二节点发送第一心跳报文;The sending thread 601 is used to send the first heartbeat message to the second node in the distributed cluster system;
接收线程602用于接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;The receiving thread 602 is used to receive the second heartbeat message returned by the second node, and obtain the current heartbeat timeout number between the second node and the second node according to the second heartbeat message. Current network connectivity status table; the second heartbeat message is a response message to the first heartbeat message;
检测线程603用于根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。The detection thread 603 is used to obtain the fault monitoring result of the second node according to the current number of heartbeat timeouts and the current network connectivity status table.
本申请实施例提供的节点故障监测装置,通过第一节点向第二节点发送第一心跳报文以及接收第二节点发送的第二心跳报文,以同步第二节点与各网络节点间的当前网络连通状态表,进而根据当前网络连通状态表获取当前心跳超时次数,并根据当前网络连通状态表以及当前心跳超时次数联合获取第二节点的故障监测结果,以使得CTDB在网络亚健康状态下准确的分析出故障节点,防止误判导致正常节点被执行故障切换和故障恢复的动作,由此提高节点检测的稳定性和可靠性,进而提高集群的稳定性、安全性和可靠性。The node fault monitoring device provided by the embodiment of the present application uses the first node to send a first heartbeat message to the second node and receives the second heartbeat message sent by the second node to synchronize the current status between the second node and each network node. Network connectivity status table, and then obtain the current number of heartbeat timeouts based on the current network connectivity status table, and jointly obtain the fault monitoring results of the second node based on the current network connectivity status table and the current number of heartbeat timeouts, so that CTDB can accurately detect the sub-health status of the network It analyzes faulty nodes to prevent misjudgment from causing normal nodes to perform failover and fault recovery actions, thus improving the stability and reliability of node detection, thereby improving the stability, security and reliability of the cluster.
本申请实施例还提供一种节点故障监测系统,该系统包括:分布式集群系统;所述分布式集群系统包括第一节点、多个第二节点,以及集群琐碎数据库;Embodiments of the present application also provide a node fault monitoring system, which system includes: a distributed cluster system; the distributed cluster system includes a first node, a plurality of second nodes, and a cluster trivial database;
所述集群琐碎数据库用于为所述第一节点和所述第二节点提供网络连通状态检测服务;The cluster trivial database is used to provide network connectivity status detection services for the first node and the second node;
所述第一节点用于执行节点故障监测方法,该方法包括向所述分布式集群系统中的第二节点发送第一心跳报文;接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。The first node is used to perform a node fault monitoring method, which method includes sending a first heartbeat message to a second node in the distributed cluster system; receiving a second heartbeat message returned by the second node, and according to The second heartbeat message obtains the current number of heartbeat timeouts with the second node and the current network connectivity status table of the second node; the second heartbeat message is the first heartbeat message. a response message of the message; and obtain the fault monitoring result of the second node according to the current heartbeat timeout number and the current network connectivity status table.
本申请实施例提供的节点故障监测系统,通过第一节点向第二节点发送第一心跳报文以及接收第二节点发送的第二心跳报文,以同步第二节点与各网络节点间的当前网络连通状态表,进而根据当前网络连通状态表获取当前心跳超时次数,并根据当前网络连通状态表以及当前心跳超时次数联合获取第二节点的故障监测结果,以使得CTDB在网络亚健康状态下准确的分析出故障节点,防止误判导致正常节点被执行故障切换和故障恢复的动作,由此提高节点检测的稳定性和可靠性,进而提高集群的稳定性、安全性和可靠性。The node fault monitoring system provided by the embodiment of the present application uses the first node to send a first heartbeat message to the second node and receive the second heartbeat message sent by the second node to synchronize the current status between the second node and each network node. Network connectivity status table, and then obtain the current number of heartbeat timeouts based on the current network connectivity status table, and jointly obtain the fault monitoring results of the second node based on the current network connectivity status table and the current number of heartbeat timeouts, so that CTDB can accurately detect the sub-health status of the network It analyzes faulty nodes to prevent misjudgment from causing normal nodes to perform failover and fault recovery actions, thus improving the stability and reliability of node detection, thereby improving the stability, security and reliability of the cluster.
图7示例了一种电子设备的实体结构示意图,如图7所示,该电子设备可以包括:处理器(processor)701、通信接口(Communications Interface)702、存储器(memory)703和通信总线704,其中,处理器701,通信接口702,存储器703通过通信总线704完成相互间的通信。处理器701可以调用存储器703中的逻辑指令,以执行节点故障监测方法,该方法包括:向所述分布式集群系统中的第二节点发送第一心跳报文;接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。Figure 7 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 7, the electronic device may include: a processor (processor) 701, a communications interface (Communications Interface) 702, a memory (memory) 703, and a communication bus 704. Among them, the processor 701, the communication interface 702, and the memory 703 complete communication with each other through the communication bus 704. The processor 701 can call logical instructions in the memory 703 to execute a node fault monitoring method. The method includes: sending a first heartbeat message to a second node in the distributed cluster system; receiving a first heartbeat message returned by the second node. The second heartbeat message is used to obtain the current number of heartbeat timeouts with the second node and the current network connectivity status table of the second node according to the second heartbeat message; the second heartbeat message is the response message of the first heartbeat message; and obtains the fault monitoring result of the second node according to the current number of heartbeat timeouts and the current network connectivity status table.
此外,上述的存储器703中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logical instructions in the memory 703 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .
另一方面,本发明还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,所述计算机程序被处理器执行时,计算机能够执行上述各方法所提供的节点故障监测方法,该方法包括:向所述分布式集群系统中的第二节点发送第一心跳报文;接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。On the other hand, the present invention also provides a computer program product. The computer program product includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can Executing the node fault monitoring method provided by the above methods, the method includes: sending a first heartbeat message to the second node in the distributed cluster system; receiving the second heartbeat message returned by the second node, according to The second heartbeat message obtains the current number of heartbeat timeouts with the second node and the current network connectivity status table of the second node; the second heartbeat message is the first heartbeat message. a response message of the message; and obtain the fault monitoring result of the second node according to the current heartbeat timeout number and the current network connectivity status table.
又一方面,本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的节点故障监测方法,该方法包括:向所述分布式集群系统中的第二节点发送第一心跳报文;接收所述第二节点返回的第二心跳报文,根据所述第二心跳报文,获取与所述第二节点之间的当前心跳超时次数,以及所述第二节点的当前网络连通状态表;所述第二心跳报文为所述第一心跳报文的响应报文;根据所述当前心跳超时次数和所述当前网络连通状态表,获取所述第二节点的故障监测结果。In another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. The computer program is implemented when executed by a processor to perform the node fault monitoring method provided by each of the above methods. The method includes : Send a first heartbeat message to the second node in the distributed cluster system; receive the second heartbeat message returned by the second node, and obtain the information related to the second node according to the second heartbeat message. The current number of heartbeat timeouts and the current network connectivity status table of the second node; the second heartbeat message is a response message of the first heartbeat message; according to the current number of heartbeat timeouts and the current network connectivity status table of the second node; The current network connectivity status table is described to obtain the fault monitoring result of the second node.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disc, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be used Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (18)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310955919.5A CN116684256B (en) | 2023-08-01 | 2023-08-01 | Node fault monitoring method, device, system, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310955919.5A CN116684256B (en) | 2023-08-01 | 2023-08-01 | Node fault monitoring method, device, system, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116684256A CN116684256A (en) | 2023-09-01 |
CN116684256B true CN116684256B (en) | 2023-11-03 |
Family
ID=87791323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310955919.5A Active CN116684256B (en) | 2023-08-01 | 2023-08-01 | Node fault monitoring method, device, system, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116684256B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117424791B (en) * | 2023-12-18 | 2024-03-19 | 国网天津市电力公司信息通信公司 | Large-scale power communication network fault diagnosis system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107426051A (en) * | 2017-07-19 | 2017-12-01 | 北京华云网际科技有限公司 | The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system |
CN109088794A (en) * | 2018-08-20 | 2018-12-25 | 郑州云海信息技术有限公司 | A kind of fault monitoring method and device of node |
CN109218141A (en) * | 2018-11-20 | 2019-01-15 | 郑州云海信息技术有限公司 | A kind of malfunctioning node detection method and relevant apparatus |
CN113542052A (en) * | 2021-06-07 | 2021-10-22 | 新华三信息技术有限公司 | Node fault determination method and device and server |
CN115102887A (en) * | 2022-07-15 | 2022-09-23 | 济南浪潮数据技术有限公司 | A cluster node monitoring method and related equipment |
-
2023
- 2023-08-01 CN CN202310955919.5A patent/CN116684256B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107426051A (en) * | 2017-07-19 | 2017-12-01 | 北京华云网际科技有限公司 | The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system |
CN109088794A (en) * | 2018-08-20 | 2018-12-25 | 郑州云海信息技术有限公司 | A kind of fault monitoring method and device of node |
CN109218141A (en) * | 2018-11-20 | 2019-01-15 | 郑州云海信息技术有限公司 | A kind of malfunctioning node detection method and relevant apparatus |
CN113542052A (en) * | 2021-06-07 | 2021-10-22 | 新华三信息技术有限公司 | Node fault determination method and device and server |
CN115102887A (en) * | 2022-07-15 | 2022-09-23 | 济南浪潮数据技术有限公司 | A cluster node monitoring method and related equipment |
Also Published As
Publication number | Publication date |
---|---|
CN116684256A (en) | 2023-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10389596B2 (en) | Discovering application topologies | |
US7743274B2 (en) | Administering correlated error logs in a computer system | |
CN108833202B (en) | Faulty link detection method, apparatus and computer-readable storage medium | |
CN110716842B (en) | Cluster fault detection method and device | |
CN112506702B (en) | Disaster recovery method, device, equipment and storage medium for data center | |
TWI691852B (en) | Error detection device and error detection method for detecting failure of hierarchical system, computer-readable recording medium and computer program product | |
WO2017133522A1 (en) | Alarm information processing method, apparatus and system, and computer storage medium | |
CN104506392B (en) | A kind of delay machine detection method and equipment | |
US11334468B2 (en) | Checking a correct operation of an application in a cloud environment | |
CN116684256B (en) | Node fault monitoring method, device, system, electronic equipment and storage medium | |
CN111314443A (en) | Node processing method, device and device and medium based on distributed storage system | |
CN111555902A (en) | Positioning system and method for network transmission abnormity | |
CN103731315A (en) | Server failure detecting method | |
CN117221091A (en) | Isolation method and device for sub-health nodes in storage cluster and electronic equipment | |
CN106911519A (en) | A kind of data acquisition monitoring method and device | |
CN111901174A (en) | Service state notification method, related device and storage medium | |
CN116781488A (en) | Database high availability implementation methods, devices, database architectures, equipment and products | |
US11314573B2 (en) | Detection of event storms | |
WO2020119400A1 (en) | Failure processing method and apparatus, and storage medium | |
CN114443438A (en) | Node state detection method, node abnormity processing method and device | |
JP2016066303A (en) | Server device, redundant configuration server system, information taking-over program and information taking-over method | |
CN115499294B (en) | A distributed storage environment network sub-health detection and fault automatic processing method | |
CN117527653A (en) | Cluster heartbeat management method, system, equipment and medium | |
CN107733702A (en) | The method and apparatus that operational state of mainframe is managed in group system | |
CN115766526A (en) | Test method and device for switch physical layer chip and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |