CN115426247B - Fault node processing method and device, storage medium and electronic equipment - Google Patents

Fault node processing method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN115426247B
CN115426247B CN202211007295.6A CN202211007295A CN115426247B CN 115426247 B CN115426247 B CN 115426247B CN 202211007295 A CN202211007295 A CN 202211007295A CN 115426247 B CN115426247 B CN 115426247B
Authority
CN
China
Prior art keywords
target
fault
node
fault node
weight value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211007295.6A
Other languages
Chinese (zh)
Other versions
CN115426247A (en
Inventor
张志雄
余振
吴政楠
李芳�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202211007295.6A priority Critical patent/CN115426247B/en
Publication of CN115426247A publication Critical patent/CN115426247A/en
Application granted granted Critical
Publication of CN115426247B publication Critical patent/CN115426247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a fault node processing method and device, a storage medium and electronic equipment. Relates to the technical field of cloud computing, and the method comprises the following steps: monitoring whether a target fault node is removed from the distributed cluster; under the condition that the target fault node is removed in the distributed cluster is monitored, determining a target weight value corresponding to the target fault node in a preset time period; marking the target fault node under the condition that the target weight value is larger than the preset weight value to obtain a marked fault node, wherein the marked fault node carries a target mark; and under the condition that the fault of the marked fault node is detected to be repaired, and the target identification in the marked fault node is not cleared, preventing the marked fault node from being added into the distributed cluster. The application solves the problem that the high availability of the cluster service is poor because the fault node is repeatedly isolated or added back to the cluster in the preset time in the related technology.

Description

Fault node processing method and device, storage medium and electronic equipment
Technical Field
The application relates to the technical field of cloud computing, in particular to a fault node processing method and device, a storage medium and electronic equipment.
Background
The distributed storage is a data storage technology for constructing a virtual storage resource pool from scattered storage resources through a network and storing data on a plurality of independent devices in a scattered way, and has the capabilities of high-performance large concurrent read-write, high-availability fault automatic isolation, dynamic expansion, operation and maintenance management automation and intellectualization and the like. However, due to the distributed deployment characteristic, the service of the whole cluster is inevitably affected when a single point is abnormal, the traditional distributed storage fault detection is mainly realized through the heartbeat of the node, the node is temporarily isolated from the cluster through a fault switching mechanism when the heartbeat is abnormal, and the node is automatically added back to the cluster after being normal. For partial faults, such as restarting of a node with a hardware sub-health accident, the node can be repeatedly isolated/added back to the cluster due to the same fault in a period of time without human intervention, and the high availability of the whole cluster service is continuously affected.
Aiming at the problem that the fault node is repeatedly isolated or added back to the cluster within preset time in the related technology, which results in poor availability of the cluster service, no effective solution is proposed at present.
Disclosure of Invention
The application mainly aims to provide a processing method and device of a fault node, a storage medium and electronic equipment, so as to solve the problem that in the related art, the fault node is repeatedly isolated or added back to a cluster within preset time, so that the high availability of cluster service is poor.
In order to achieve the above object, according to one aspect of the present application, there is provided a method of processing a failed node. The method comprises the following steps: monitoring whether a target fault node is removed from the distributed cluster; under the condition that the target fault node is removed in the distributed cluster is monitored, determining a target weight value corresponding to the target fault node in a preset time period; marking the target fault node under the condition that the target weight value is larger than the preset weight value to obtain a marked fault node, wherein the marked fault node carries a target mark; and under the condition that the fault of the marked fault node is detected to be repaired, and the target identification in the marked fault node is not cleared, preventing the marked fault node from being added into the distributed cluster.
Further, before determining the target weight value corresponding to the target fault node in the preset time period under the condition that the target fault node is removed in the distributed cluster is monitored, the method further comprises: monitoring a plurality of fault nodes in the distributed cluster; determining a fault type corresponding to each fault node, wherein the fault type at least comprises: server memory failure, server CPU failure; configuring a weight value corresponding to each fault node according to the fault type; and synchronizing the weight value corresponding to each fault node into the management nodes of the distributed cluster.
Further, before determining the target weight value corresponding to the target fault node in the preset time period, the method further includes: and under the condition that the target fault node is removed in the distributed cluster is monitored, acquiring fault information associated with the target fault node, wherein the fault information at least comprises: the fault type of the fault node; and matching the fault type with a plurality of weight values pre-configured in the distributed cluster to obtain an initial weight value of the target fault node.
Further, under the condition that the target fault node is removed in the distributed cluster, determining the target weight value corresponding to the target fault node in the preset time period includes: judging whether the initial weight value is larger than a preset weight value or not; under the condition that the initial weight value is not larger than the preset weight value, counting the removal times of the target fault node in the preset time period in the distributed cluster; and determining a target weight value corresponding to the target fault node in a preset time period according to the removal times.
Further, determining the target weight value corresponding to the target fault node in the preset time period according to the removal times includes: multiplying the removal times with the removal weight value to obtain a target removal weight value; and superposing the initial weight value and the target removal weight value to obtain a target weight value corresponding to the target fault node in a preset time period.
Further, in the event that a repair of the failure of the marked failed node is detected and the target identification in the marked failed node is not cleared, the method further comprises, prior to preventing the marked failed node from being added to the distributed cluster: detecting whether a target mark in the marked fault node is cleared or not under the condition that the fault of the marked fault node is detected to be repaired; if the target identification in the marked fault node is detected to be cleared, acquiring the fault node after the marking is cleared, and allowing the fault node after the marking is cleared to be added into the distributed cluster.
Further, in the case that the fault of the marked fault node is detected to be repaired, and the target identifier in the marked fault node is not cleared, the method further includes, after preventing the marked fault node from being added to the distributed cluster: and triggering warning information that the target identifier is not cleared, and sending the warning information to operation and maintenance personnel so that the operation and maintenance personnel can clear the target identifier in the marked fault node.
In order to achieve the above object, according to another aspect of the present application, there is provided a processing apparatus of a failed node. The device comprises: the monitoring unit is used for monitoring whether the target fault node is removed in the distributed cluster; the first determining unit is used for determining a target weight value corresponding to the target fault node in a preset time period under the condition that the situation that the target fault node is removed in the distributed cluster is monitored; the marking unit is used for marking the target fault node under the condition that the target weight value is larger than the preset weight value to obtain a marked fault node, wherein the marked fault node carries a target mark; and the blocking unit is used for blocking the marked fault node from being added into the distributed cluster when the fault of the marked fault node is detected to be repaired and the target mark in the marked fault node is not cleared.
According to the application, the following steps are adopted: monitoring whether a target fault node is removed from the distributed cluster; under the condition that the target fault node is removed in the distributed cluster is monitored, determining a target weight value corresponding to the target fault node in a preset time period; marking the target fault node under the condition that the target weight value is larger than the preset weight value to obtain a marked fault node, wherein the marked fault node carries a target mark; and under the condition that the fault of the marked fault node is detected to be repaired, and the target identification in the marked fault node is not cleared, preventing the marked fault node from being added into the distributed cluster. The application solves the problem that the fault node is repeatedly isolated or added back to the cluster within the preset time to cause poor high availability of the cluster service in the related technology, and prevents the marked fault node from being added into the distributed cluster when the fault of the marked fault node is detected to be repaired and the target mark in the marked fault node is not cleared, thereby achieving the effect of improving the high availability of the cluster service.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:
FIG. 1 is a flow chart of a method of handling a failed node provided in accordance with an embodiment of the present application;
FIG. 2 is a schematic diagram I of a processing apparatus for a failed node according to an embodiment of the present application;
FIG. 3 is a schematic diagram II of a processing apparatus for a failed node according to an embodiment of the present application;
fig. 4 is a schematic diagram of an electronic device for processing a failed node according to an embodiment of the present application.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the application herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.
The present application will be described with reference to preferred implementation steps, and fig. 1 is a flowchart of a method for processing a failed node according to an embodiment of the present application, as shown in fig. 1, where the method includes the following steps:
Step S101, monitoring whether a target failure node is removed from the distributed cluster.
Specifically, in the distributed cluster, if an abnormal server is detected, a fault node (i.e. the fault server) is temporarily isolated from the cluster through a fault switching mechanism.
Step S102, determining a target weight value corresponding to the target fault node in a preset time period under the condition that the fact that the target fault node is removed in the distributed cluster is monitored.
For example, the distributed cluster includes 10 servers, and if a fault occurs in one of the servers (nodes) is monitored, a target weight value corresponding to a target server (i.e., a target fault node in the present application) in a preset time period is determined.
Optionally, in the method for processing a fault node provided by the embodiment of the present application, before determining a target weight value corresponding to a target fault node in a preset time period when it is monitored that the target fault node is removed in the distributed cluster, the method further includes: monitoring a plurality of fault nodes in the distributed cluster; determining a fault type corresponding to each fault node, wherein the fault type at least comprises: server memory failure, server CPU failure; configuring a weight value corresponding to each fault node according to the fault type; and synchronizing the weight value corresponding to each fault node into the management nodes of the distributed cluster.
For example, according to the actual situation, the fault type and the weight to be processed are selected and configured independently, for example, the fault node weight with the fault type of 0x1001 (server memory fault) is configured to be 0.8, the weight with the fault type of 0x1002 (server CPU fault) is configured to be 0.6, and the like.
Optionally, in the method for processing a fault node provided by the embodiment of the present application, before determining a target weight value corresponding to a target fault node in a preset time period, the method further includes: and under the condition that the target fault node is removed in the distributed cluster is monitored, acquiring fault information associated with the target fault node, wherein the fault information at least comprises: the fault type of the fault node; and matching the fault type with a plurality of weight values pre-configured in the distributed cluster to obtain an initial weight value of the target fault node.
For example, the management node adaptively manages the fault node aiming at the configured fault type and weight value, when the fault node A in the distributed cluster is monitored to be removed (isolated), fault alarm information related to the fault node A is automatically collected, the fault type in the fault alarm information is matched with a plurality of weight values which are configured in advance to obtain the weight value corresponding to the fault node A, and the weight value corresponding to the fault node A is used as an initial weight value in the application, wherein the plurality of weight values which are configured in advance in the application can comprise the weight value corresponding to each fault node in the distributed cluster, and the initial weight value corresponding to the target fault node can be obtained quickly by matching the fault information with the plurality of weight values which are configured in advance.
Step S103, marking the target fault node to obtain a marked fault node under the condition that the target weight value is larger than the preset weight value, wherein the marked fault node carries the target mark.
Specifically, before marking the target fault node, it is required to determine whether the initial weight value corresponding to the node is greater than a preset weight value.
Optionally, in the method for processing a fault node provided by the embodiment of the present application, determining, when it is monitored that a target fault node is removed in a distributed cluster, a target weight value corresponding to the target fault node in a preset time period includes: judging whether the initial weight value is larger than a preset weight value or not; under the condition that the initial weight value is not larger than the preset weight value, counting the removal times of the target fault node in the preset time period in the distributed cluster; and determining a target weight value corresponding to the target fault node in a preset time period according to the removal times.
For example, if the initial weight value is 0.6 and the preset weight value is 1, if the node is removed from the distributed cluster twice in the preset time period under the condition that the initial weight value is not greater than the preset weight value, determining a target weight value corresponding to the target fault node in the preset time period according to the removal times.
Specifically, determining, according to the removal times, a target weight value corresponding to the target fault node in the preset time period includes: multiplying the removal times with the removal weight value to obtain a target removal weight value; and superposing the initial weight value and the target removal weight value to obtain a target weight value corresponding to the target fault node in a preset time period.
For example, assuming that the initial weight value of the target fault node is 0.8, the removal weight value of the distributed cluster removed by the node each time is 0.2 in the preset time period, if the removal times are 2, the target removal weight value is 0.4, the initial weight value and the target removal weight value are overlapped to obtain the target weight value of 1.2, the target fault node is marked to obtain the marked fault node, and the high risk identification is performed on the target fault node according to the target weight value, namely the marked fault node carries the high risk identification (namely the target identification in the application).
Step S104, when the fault of the marked fault node is detected to be repaired and the target identification in the marked fault node is not cleared, adding the marked fault node into the distributed cluster is prevented.
Specifically, in a preset time period, the fault node is repaired, whether the high-risk identification in the marked fault node is cleared or not is required to be detected, whether the repaired fault node is added to the distributed cluster is determined according to a detection result, the fault node with the risk identification can be effectively prevented from being repeatedly added back to the distributed cluster in the preset time, and the high-reliability capability of distributed storage is improved.
Optionally, in the method for processing a failed node provided by the embodiment of the present application, when it is detected that a failure of a marked failed node is repaired and a target identifier in the marked failed node is not cleared, before adding the marked failed node to the distributed cluster, the method further includes: detecting whether a target mark in the marked fault node is cleared or not under the condition that the fault of the marked fault node is detected to be repaired; if the target identification in the marked fault node is detected to be cleared, acquiring the fault node after the marking is cleared, and allowing the fault node after the marking is cleared to be added into the distributed cluster.
Specifically, if the fault node is recovered to be normal, the management node automatically adds back the node and detects the special mark of the node, if the node is found to be the node carrying the high-risk mark, the addition of the node back to the distributed cluster is prevented, if the node is found to be the node not carrying the high-risk mark, the fault node after the mark is cleared is allowed to be added to the distributed cluster, and the fault node with the risk mark can be effectively prevented from being repeatedly added back to the distributed cluster within the preset time, so that the high reliability capability of distributed storage is improved.
Optionally, in the case that the fault of the marked fault node is detected to be repaired and the target identifier in the marked fault node is not cleared, the method further includes, after preventing the marked fault node from being added to the distributed cluster: and triggering warning information that the target identifier is not cleared, and sending the warning information to operation and maintenance personnel so that the operation and maintenance personnel can clear the target identifier in the marked fault node.
Specifically, aiming at a node which refuses to be added back to the cluster and carries high-risk identification, an alarm is sent out and relevant operation and maintenance personnel are notified, the high-risk identification of the node is cleared after the operation and maintenance personnel process the fault, and the cluster is added back again.
In summary, the method for processing the fault node provided by the embodiment of the application monitors whether the target fault node is removed from the distributed cluster; under the condition that the target fault node is removed in the distributed cluster is monitored, determining a target weight value corresponding to the target fault node in a preset time period; marking the target fault node under the condition that the target weight value is larger than the preset weight value to obtain a marked fault node, wherein the marked fault node carries a target mark; and under the condition that the fault of the marked fault node is detected to be repaired, and the target identification in the marked fault node is not cleared, preventing the marked fault node from being added into the distributed cluster. The application solves the problem that the fault node is repeatedly isolated or added back to the cluster within the preset time to cause poor high availability of the cluster service in the related technology, and prevents the marked fault node from being added into the distributed cluster when the fault of the marked fault node is detected to be repaired and the target mark in the marked fault node is not cleared, thereby achieving the effect of improving the high availability of the cluster service.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
The embodiment of the application also provides a processing device of the fault node, and the processing device of the fault node can be used for executing the processing method for the fault node. The following describes a processing device of a fault node provided by an embodiment of the present application.
Fig. 2 is a schematic diagram of a processing apparatus of a failed node according to an embodiment of the present application. As shown in fig. 2, the apparatus includes: a listening unit 201, a first determining unit 202, a marking unit 203, a blocking unit 204.
Specifically, the monitoring unit 201 is configured to monitor whether a target failure node is removed in the distributed cluster;
a first determining unit 202, configured to determine a target weight value corresponding to a target failure node in a preset period of time when it is monitored that the target failure node is removed in the distributed cluster;
The marking unit 203 is configured to mark a target fault node to obtain a marked fault node when the target weight value is greater than a preset weight value, where the marked fault node carries a target identifier;
and the blocking unit 204 is configured to block the marked failed node from being added to the distributed cluster when the marked failed node is detected to be repaired and the target identifier in the marked failed node is not cleared.
In summary, the processing device for a failed node provided by the embodiment of the present application monitors, through the monitoring unit 201, whether a target failed node is removed in a distributed cluster; the first determining unit 202 determines a target weight value corresponding to the target fault node in a preset time period when monitoring that the target fault node is removed in the distributed cluster; the marking unit 203 marks the target fault node to obtain a marked fault node when the target weight value is larger than a preset weight value, wherein the marked fault node carries a target identifier; the preventing unit 204 prevents the marked fault node from being added to the distributed cluster when the fault of the marked fault node is detected to be repaired and the target identifier in the marked fault node is not cleared, so that the problem that the fault node is repeatedly isolated or added back to the cluster within a preset time to cause poor availability of the cluster service in the related art is solved.
Optionally, in the processing device for a fault node provided by the embodiment of the present application, the device further includes: the second determining unit is used for monitoring a plurality of fault nodes in the distributed cluster before determining a target weight value corresponding to the target fault node in a preset time period under the condition that the target fault node in the distributed cluster is removed; a third determining unit, configured to determine a fault type corresponding to each fault node, where the fault type at least includes: server memory failure, server CPU failure; the configuration unit is used for configuring the weight value corresponding to each fault node according to the fault type; and the synchronization unit is used for synchronizing the weight value corresponding to each fault node into the management nodes of the distributed cluster.
Optionally, in the processing device for a fault node provided by the embodiment of the present application, the device further includes: the first obtaining unit is configured to obtain, before determining a target weight value corresponding to a target fault node in a preset time period, fault information associated with the target fault node if it is monitored that the target fault node is removed in the distributed cluster, where the fault information at least includes: the fault type of the fault node; the matching unit is used for matching the fault type with a plurality of weight values pre-configured in the distributed cluster to obtain an initial weight value of the target fault node.
Optionally, in the processing device for a faulty node provided in the embodiment of the present application, the first determining unit includes: the judging module is used for judging whether the initial weight value is larger than a preset weight value or not; the statistics module is used for counting the removal times of the target fault node in a preset time period in the distributed cluster under the condition that the initial weight value is not larger than the preset weight value; the determining module is used for determining a target weight value corresponding to the target fault node in a preset time period according to the removal times.
Optionally, in the processing device for a faulty node provided by the embodiment of the present application, the determining module includes: the calculation sub-module is used for multiplying the removal times with the removal weight value to obtain a target removal weight value; and the superposition sub-module is used for superposing the initial weight value and the target removal weight value to obtain a target weight value corresponding to the target fault node in the preset time period.
Optionally, in the processing device for a fault node provided by the embodiment of the present application, the device further includes: the detection unit is used for detecting whether the target identifier in the marked fault node is cleared or not under the condition that the fault of the marked fault node is detected to be repaired and before the marked fault node is prevented from being added into the distributed cluster when the target identifier in the marked fault node is not cleared; and the second acquisition unit is used for acquiring the fault node after the mark is cleared if the target mark in the fault node after the mark is detected to be cleared, and allowing the fault node after the mark is cleared to be added into the distributed cluster.
Optionally, in the processing device for a fault node provided by the embodiment of the present application, the device further includes: the triggering unit is used for preventing the marked fault node from being added into the distributed cluster when the marked fault node is detected to be repaired and the target mark in the marked fault node is not cleared, triggering warning information that the target mark is not cleared and sending the warning information to the operation and maintenance personnel so that the operation and maintenance personnel can clear the target mark in the marked fault node.
Optionally, fig. 3 is a schematic diagram two of a processing apparatus of a faulty node according to an embodiment of the present application. As shown in fig. 3, the configuration entry module 1, the fault adaptive management module 2, the abnormality recovery module 3 and the fault processing module 4 mainly comprise 4 modules. The configuration input module stores the fault type and weight to be adaptively processed in the distributed storage cluster, the fault adaptive management module calculates whether the node belongs to the high-risk node and marks the weight of the node through the input configuration, the abnormal recovery module is used for judging whether the node is allowed to be added back to the cluster, and finally the fault processing module processes the abnormal node and clears the high-risk mark based on the alarm result, and the distributed storage cluster service is recovered to be normal.
Configuration input module: the fault type and the weight are used for the user to autonomously select and configure the fault type and the weight to be processed, and are stored in the management system, wherein the fault type is a unique id defined by each system, the weight is the severity degree of the fault type judged by the user according to an actual manager, and different user configurations are different.
And the fault self-adaptive management module is used for: and automatically creating a fault self-adaptive management module, starting to work when the node has a fault, collecting node information, calculating whether the node is at high risk or not, and isolating the fault node according to the original flow.
An anomaly recovery module: and for the nodes which are recovered to be normal, detecting whether high-risk marks marked by the fault self-adaptive management module exist before the nodes add back the clusters, if so, rejecting to add back the clusters, and if not, allowing to add back.
And a fault processing module: and alarming the nodes marked with the high-risk incapable of being added back to the cluster and notifying operation and maintenance personnel to process, and clearing the high-risk marks of the nodes after the fault processing is completed.
The processing device of the fault node includes a processor and a memory, where the above-mentioned listening unit 201, the first determining unit 202, the marking unit 203, the blocking unit 204, etc. are stored as program units, and the processor executes the above-mentioned program units stored in the memory to implement the corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel may be provided with one or more kernel parameters to handle the processing of the failed node.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
The embodiment of the invention provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements a method for processing a failed node.
The embodiment of the invention provides a processor, which is used for running a program, wherein the processing method of a fault node is executed when the program runs.
As shown in fig. 4, an embodiment of the present invention provides an electronic device, where the device includes a processor, a memory, and a program stored in the memory and executable on the processor, and when the processor executes the program, the following steps are implemented: monitoring whether a target fault node is removed from the distributed cluster; under the condition that the target fault node is removed in the distributed cluster is monitored, determining a target weight value corresponding to the target fault node in a preset time period; marking the target fault node under the condition that the target weight value is larger than the preset weight value to obtain a marked fault node, wherein the marked fault node carries a target mark; and under the condition that the fault of the marked fault node is detected to be repaired, and the target identification in the marked fault node is not cleared, preventing the marked fault node from being added into the distributed cluster.
The processor also realizes the following steps when executing the program: monitoring a plurality of fault nodes in the distributed cluster before determining a target weight value corresponding to the target fault node in a preset time period under the condition that the target fault node in the distributed cluster is removed; determining a fault type corresponding to each fault node, wherein the fault type at least comprises: server memory failure, server CPU failure; configuring a weight value corresponding to each fault node according to the fault type; and synchronizing the weight value corresponding to each fault node into the management nodes of the distributed cluster.
The processor also realizes the following steps when executing the program: before determining a target weight value corresponding to a target fault node in a preset time period, acquiring fault information associated with the target fault node under the condition that the fact that the target fault node is removed in the distributed cluster is monitored, wherein the fault information at least comprises: the fault type of the fault node; and matching the fault type with a plurality of weight values pre-configured in the distributed cluster to obtain an initial weight value of the target fault node.
The processor also realizes the following steps when executing the program: judging whether the initial weight value is larger than a preset weight value or not; under the condition that the initial weight value is not larger than the preset weight value, counting the removal times of the target fault node in the preset time period in the distributed cluster; and determining a target weight value corresponding to the target fault node in a preset time period according to the removal times.
The processor also realizes the following steps when executing the program: multiplying the removal times with the removal weight value to obtain a target removal weight value; and superposing the initial weight value and the target removal weight value to obtain a target weight value corresponding to the target fault node in a preset time period.
The processor also realizes the following steps when executing the program: when the fault of the marked fault node is detected to be repaired and the target mark in the marked fault node is not cleared, before the marked fault node is prevented from being added into the distributed cluster, whether the target mark in the marked fault node is cleared or not is detected under the condition that the fault of the marked fault node is detected to be repaired; if the target identification in the marked fault node is detected to be cleared, acquiring the fault node after the marking is cleared, and allowing the fault node after the marking is cleared to be added into the distributed cluster.
The processor also realizes the following steps when executing the program: when the situation that the faults of the marked fault nodes are repaired is detected, and the target identifiers in the marked fault nodes are not cleared is prevented, after the marked fault nodes are added into the distributed cluster, the warning information that the target identifiers are not cleared is triggered, and the warning information is sent to operation and maintenance personnel, so that the operation and maintenance personnel clear the target identifiers in the marked fault nodes.
The device herein may be a server, PC, PAD, cell phone, etc.
The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: monitoring whether a target fault node is removed from the distributed cluster; under the condition that the target fault node is removed in the distributed cluster is monitored, determining a target weight value corresponding to the target fault node in a preset time period; marking the target fault node under the condition that the target weight value is larger than the preset weight value to obtain a marked fault node, wherein the marked fault node carries a target mark; and under the condition that the fault of the marked fault node is detected to be repaired, and the target identification in the marked fault node is not cleared, preventing the marked fault node from being added into the distributed cluster.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: monitoring a plurality of fault nodes in the distributed cluster before determining a target weight value corresponding to the target fault node in a preset time period under the condition that the target fault node in the distributed cluster is removed; determining a fault type corresponding to each fault node, wherein the fault type at least comprises: server memory failure, server CPU failure; configuring a weight value corresponding to each fault node according to the fault type; and synchronizing the weight value corresponding to each fault node into the management nodes of the distributed cluster.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: before determining a target weight value corresponding to a target fault node in a preset time period, acquiring fault information associated with the target fault node under the condition that the fact that the target fault node is removed in the distributed cluster is monitored, wherein the fault information at least comprises: the fault type of the fault node; and matching the fault type with a plurality of weight values pre-configured in the distributed cluster to obtain an initial weight value of the target fault node.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: judging whether the initial weight value is larger than a preset weight value or not; under the condition that the initial weight value is not larger than the preset weight value, counting the removal times of the target fault node in the preset time period in the distributed cluster; and determining a target weight value corresponding to the target fault node in a preset time period according to the removal times.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: multiplying the removal times with the removal weight value to obtain a target removal weight value; and superposing the initial weight value and the target removal weight value to obtain a target weight value corresponding to the target fault node in a preset time period.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: when the fault of the marked fault node is detected to be repaired and the target mark in the marked fault node is not cleared, before the marked fault node is prevented from being added into the distributed cluster, whether the target mark in the marked fault node is cleared or not is detected under the condition that the fault of the marked fault node is detected to be repaired; if the target identification in the marked fault node is detected to be cleared, acquiring the fault node after the marking is cleared, and allowing the fault node after the marking is cleared to be added into the distributed cluster.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: when the situation that the faults of the marked fault nodes are repaired is detected, and the target identifiers in the marked fault nodes are not cleared is prevented, after the marked fault nodes are added into the distributed cluster, the warning information that the target identifiers are not cleared is triggered, and the warning information is sent to operation and maintenance personnel, so that the operation and maintenance personnel clear the target identifiers in the marked fault nodes.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims (8)

1. A method for processing a failed node, comprising:
monitoring whether a target fault node is removed from the distributed cluster;
Determining a target weight value corresponding to the target fault node in a preset time period under the condition that the target fault node is removed in the distributed cluster is monitored;
marking the target fault node under the condition that the target weight value is larger than a preset weight value to obtain a marked fault node, wherein the marked fault node carries a target mark;
Preventing the marked fault node from being added to the distributed cluster when the fault of the marked fault node is detected to be repaired and the target identification in the marked fault node is not cleared;
before determining a target weight value corresponding to the target fault node in a preset time period, acquiring fault information associated with the target fault node under the condition that the fact that the target fault node is removed in the distributed cluster is monitored, wherein the fault information at least comprises: the fault type of the fault node;
matching the fault type with a plurality of weight values pre-configured in the distributed cluster to obtain an initial weight value of the target fault node;
Under the condition that the target fault node is removed in the distributed cluster is monitored, determining a target weight value corresponding to the target fault node in a preset time period comprises the following steps:
judging whether the initial weight value is larger than the preset weight value or not; under the condition that the initial weight value is not larger than a preset weight value, counting the removal times of the target fault node in a preset time period in the distributed cluster;
and determining the target weight value corresponding to the target fault node in the preset time period according to the removal times.
2. The method of claim 1, wherein before determining the target weight value corresponding to the target failed node in the preset period of time when it is monitored that the target failed node is removed in the distributed cluster, the method further comprises:
monitoring a plurality of fault nodes in the distributed cluster;
Determining a fault type corresponding to each fault node, wherein the fault type at least comprises: server memory failure, server CPU failure;
Configuring a weight value corresponding to each fault node according to the fault type;
and synchronizing the weight value corresponding to each fault node to the management node of the distributed cluster.
3. The method of claim 1, wherein determining the target weight value corresponding to the target failure node in the preset time period according to the removal times comprises:
multiplying the removal times with the removal weight value to obtain a target removal weight value;
and superposing the initial weight value and the target removal weight value to obtain the target weight value corresponding to the target fault node in the preset time period.
4. The method of claim 1, wherein in the event that a repair of the marked failed node is detected and the target identification in the marked failed node is not cleared, preventing the marked failed node from being added to the distributed cluster before, the method further comprises:
detecting whether the target identifier in the marked fault node is cleared or not under the condition that the fault of the marked fault node is detected to be repaired;
And if the target identifier in the marked fault node is detected to be cleared, acquiring the fault node after the marking is cleared, and allowing the fault node after the marking is cleared to be added into the distributed cluster.
5. The method of claim 1, wherein upon detecting that the failure of the marked failed node is repaired and when the target identification in the marked failed node is not cleared, preventing the marked failed node from being added to the distributed cluster, the method further comprises:
And triggering warning information that the target identifier is not cleared, and sending the warning information to operation and maintenance personnel so that the operation and maintenance personnel can clear the target identifier in the marked fault node.
6. A processing apparatus for a failed node, comprising:
the monitoring unit is used for monitoring whether the target fault node is removed in the distributed cluster;
The first determining unit is used for determining a target weight value corresponding to the target fault node in a preset time period under the condition that the target fault node in the distributed cluster is monitored to be removed;
The marking unit is used for marking the target fault node to obtain a marked fault node under the condition that the target weight value is larger than a preset weight value, wherein the marked fault node carries a target mark;
a blocking unit, configured to block addition of the marked failed node to the distributed cluster when it is detected that the failure of the marked failed node is repaired and the target identifier in the marked failed node is not cleared;
wherein, the processing device of the fault node further comprises: the first obtaining unit is configured to obtain, before determining a target weight value corresponding to the target fault node in a preset time period, fault information associated with the target fault node if it is monitored that the target fault node is removed in the distributed cluster, where the fault information at least includes: the fault type of the fault node; the matching unit is used for matching the fault type with a plurality of weight values preset in the distributed cluster to obtain an initial weight value of the target fault node;
Wherein the first determining unit includes: the judging module is used for judging whether the initial weight value is larger than the preset weight value or not; the statistics module is used for counting the removal times of the target fault node in the preset time period in the distributed cluster under the condition that the initial weight value is not larger than the preset weight value; and the determining module is used for determining the target weight value corresponding to the target fault node in the preset time period according to the removal times.
7. A computer readable storage medium storing a program, wherein the program when executed by a processor implements the method of any one of claims 1 to 5.
8. An electronic device comprising one or more processors and memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-5.
CN202211007295.6A 2022-08-22 2022-08-22 Fault node processing method and device, storage medium and electronic equipment Active CN115426247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211007295.6A CN115426247B (en) 2022-08-22 2022-08-22 Fault node processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211007295.6A CN115426247B (en) 2022-08-22 2022-08-22 Fault node processing method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN115426247A CN115426247A (en) 2022-12-02
CN115426247B true CN115426247B (en) 2024-04-26

Family

ID=84197810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211007295.6A Active CN115426247B (en) 2022-08-22 2022-08-22 Fault node processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115426247B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110740064A (en) * 2019-10-25 2020-01-31 北京浪潮数据技术有限公司 Distributed cluster node fault processing method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8195976B2 (en) * 2005-06-29 2012-06-05 International Business Machines Corporation Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110740064A (en) * 2019-10-25 2020-01-31 北京浪潮数据技术有限公司 Distributed cluster node fault processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115426247A (en) 2022-12-02

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
CN107179957B (en) Physical machine fault classification processing method and device and virtual machine recovery method and system
US9836952B2 (en) Alarm causality templates for network function virtualization
CN103605722A (en) Method, device and equipment for database monitoring
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN106856489A (en) A kind of service node switching method and apparatus of distributed memory system
CN106789141B (en) Gateway equipment fault processing method and device
CN110677480B (en) Node health management method and device and computer readable storage medium
CN110275992B (en) Emergency processing method, device, server and computer readable storage medium
CN110611597A (en) Cross-domain operation and maintenance system based on unidirectional network gate environment
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN106021070A (en) Method and device for server cluster monitoring
CN111813348A (en) Node event processing device, method, equipment and medium in unified storage equipment
CN111147313A (en) Message abnormity monitoring method and device, storage medium and electronic equipment
CN115426247B (en) Fault node processing method and device, storage medium and electronic equipment
CN115202958A (en) Power abnormity monitoring method and device, electronic equipment and storage medium
CN113411209A (en) Distributed password service full-link detection system and method
CN109104314B (en) Method and device for modifying log configuration file
CN115686831A (en) Task processing method and device based on distributed system, equipment and medium
JP2017521802A (en) Architecture for correlation events for supercomputer monitoring
CN111967968B (en) Block chain-based vulnerability processing method and device
CN109426559B (en) Command issuing method and device, storage medium and processor
CN114691395A (en) Fault processing method and device, electronic equipment and storage medium
JP6984119B2 (en) Monitoring equipment, monitoring programs, and monitoring methods
CN108959024A (en) A kind of cluster monitoring method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant