CN117376084A

CN117376084A - Fault detection method, electronic equipment and medium thereof

Info

Publication number: CN117376084A
Application number: CN202210776687.2A
Authority: CN
Inventors: 刘超; 龚航
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2024-01-09

Abstract

The present disclosure relates to the field of computer security technologies, and in particular, to a fault detection method, an electronic device, and a medium thereof. According to the fault detection method, the historical fault data are utilized to train the fault detection model, so that the fault detection model can detect fault types of faults possibly occurring in each node in the cluster network and fault probabilities corresponding to each type of faults occurring in each node according to the real-time fault data. And then selecting the fault type with the largest fault probability from the fault types according to the fault types of the nodes and the fault probability corresponding to each fault type as the fault type of the cluster network. By the method, under the condition that the cluster network fails, the failure type of the failure of the cluster network and the root cause node corresponding to the failure can be detected, so that research personnel can take targeted measures, the failure problem is eliminated, and the operation of the cluster network is effectively maintained.

Description

Fault detection method, electronic equipment and medium thereof

Technical Field

The present disclosure relates to the field of computer security technologies, and in particular, to a fault detection method, an electronic device, and a medium thereof.

Background

Cluster networks (cluster) are currently the most dominant data network systems due to the advantages of high management and control efficiency, high flexibility and the like. A clustered network is a network system made up of a plurality of nodes (nodes), each of which is a separate host for performing specific tasks, such as computing tasks, monitoring tasks, data transmission tasks, and so forth.

In such a network system, a failure of one network device may cause other network devices in the network system to be affected, so that it is required to locate the failed device, the failure type, the failure cause, etc. as soon as possible, in order to quickly repair the network failure.

The current fault detection method is to determine link connection between nodes in a cluster network according to the topology structure of the cluster network, then obtain a network delay value, i.e. a PING value, between the nodes in the cluster network according to a certain internet packet explorer (packet internet groper, PING) strategy, and then determine whether each link of the cluster network has a fault or not and the position of the fault by combining the PING value with a corresponding threshold condition.

However, the method can only detect the on-off state of each node link at present, and cannot detect other fault types of equipment in the cluster network.

Disclosure of Invention

In order to solve the problems, the application provides a fault detection method, an electronic device and a medium thereof. This development is described below.

In a first aspect, an embodiment of the present application provides a fault detection method, applied to a plurality of nodes in a clustered network, including: acquiring real-time monitoring information of a first node in a cluster network, and selecting abnormal real-time monitoring information from the real-time monitoring information; inputting the abnormal real-time monitoring information of the first node into a fault detection model to obtain first fault information of the first node, wherein the first fault information comprises the presumed fault types of the first node and fault probabilities corresponding to the presumed fault types; and determining the real fault type of the cluster network according to the estimated fault type of the first node and the fault probability corresponding to each estimated fault type.

Wherein each node in the clustered network refers to each device of the clustered network. In some implementations, the fault detection methods of the present application may be applied to individual devices in a clustered network. The fault detection method of the present application may be executed by a device in the clustered network, or may be executed by other devices outside the clustered network to detect a fault in the clustered network, which is not limited in this application.

In some implementations, the first node may be any node in a clustered network. Acquiring the real-time monitoring information of the first node in the cluster network refers to acquiring the real-time monitoring information of all or part of the first node in the cluster network. The real-time monitoring information of the first node includes all the monitoring information of the first node that can be obtained, for example, jitter delay information that represents a communication state between devices, such as a PING value, a PING bandwidth value, and the like, which is not limited in the present application.

The abnormal real-time monitoring information of the first node refers to monitoring information of abnormal index values determined according to the real-time monitoring information.

The first fault information of the first node comprises the estimated fault type of the first node and the fault probability corresponding to each estimated fault type, which are determined according to the fault detection model. The fault type is presumed to be obtained according to the fault detection model and the abnormal real-time monitoring information, and the fault type of the first node possibly occurs.

In some implementations, the fault detection model is trained according to historical abnormal monitoring information (i.e., the following historical fault data) of each node in the cluster network, so that the abnormal real-time monitoring information of the first node is analyzed by using the fault detection model, and which fault type and the probability of the fault type corresponding to the abnormal real-time monitoring information of the first node can be determined.

And finally, determining the true fault type of the cluster network according to the estimated fault type and the fault probability of the first node.

By the method, under the condition that the cluster network fails, the failure type of the failure of the cluster network can be detected through the failure detection model, so that research and development personnel can take targeted measures, the failure problem is eliminated, and the operation of the cluster network is effectively maintained.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first fault information further includes a fault type that may occur when the first node is a first type node, a fault probability corresponding to each fault type, a fault type that may occur when the first node is a second type node, a fault probability corresponding to each fault type,

the method further comprises the steps of:

and correspondingly increasing the fault probability of the same fault type under the condition of the first type node by utilizing the fault probability of each fault type corresponding to the second type node, so as to obtain second fault information of the first node, wherein the second node and the first node are adjacent nodes, and the second fault information comprises the fault type possibly occurring when the first node is the first type node after adjustment, the fault probability corresponding to each fault type, the fault type possibly occurring when the first node is the second type node and the fault probability corresponding to each fault type.

Wherein the first type of nodes refer to root cause nodes, and the second type of nodes refer to affected nodes. That is, in order to determine the root cause node corresponding to the occurrence of the failure of the clustered network, the first failure information may further include a type of speculative failure corresponding to the occurrence of the failure of the first node as the root cause node and a failure probability corresponding to each type of speculative failure. It can be understood that, correspondingly, the historical anomaly monitoring information of the training fault detection model should also include the historical anomaly monitoring information and the actual fault type corresponding to the case that each node is the root cause node, and the historical anomaly monitoring information and the actual fault type corresponding to the case that each node is the affected node.

And because the adjacent nodes are more likely to generate the same type of faults, the fault probability of the first node under the same type of the speculated faults can be adjusted by utilizing the type of the speculated faults and the fault probability of the second node adjacent to the first node, so that the adjusted type of the speculated faults and the adjusted fault probability of the first node are obtained. In some implementations, neighboring nodes include physically connected nodes or processed traffic or data have dependency nodes.

Specifically, the second fault information may be obtained by increasing the fault probability under the same estimated fault type when the first node is the root node, using the estimated fault type and the fault probability corresponding to the second node being the affected node.

And then determining the true fault type of the cluster network according to the second fault information.

With reference to the first aspect, in a possible implementation manner of the first aspect, determining, according to a possible failure type of the first node and a failure probability corresponding to each failure type, a failure type of the clustered network includes:

and taking the fault type corresponding to which the fault probability is larger than a first threshold value as the fault type of the cluster network according to the estimated fault type when the adjusted first node is the first type node and the fault probability corresponding to each fault type.

That is, after determining the second fault information of the first node, the corresponding speculative fault type with the fault probability greater than the first threshold (i.e., the fault probability threshold below) may be determined as the true fault type of the cluster network according to each of the corresponding speculative fault types and the fault probability corresponding to each of the speculative fault types in the case where the first node is the root node.

In some implementations, if there are multiple speculative fault types with a fault probability greater than the first threshold, one may be selected from any one as the true fault type of the clustered network, or the multiple speculative fault types may all be considered as fault types of the clustered network, which is not limited in this application.

With reference to the first aspect, in a possible implementation manner of the first aspect, obtaining abnormal real-time monitoring information of the first node according to the real-time monitoring information includes:

and comparing the real-time monitoring information with a first preset condition, and deleting monitoring information meeting the first preset condition in the real-time monitoring information to obtain abnormal real-time monitoring information of the first node.

That is, in some implementations, the real-time monitoring information may be compared with a first preset condition (i.e., a corresponding condition below) corresponding to the real-time monitoring information, and the monitoring information that satisfies the first preset condition may be deleted, so as to obtain abnormal real-time monitoring information of the first node.

With reference to the first aspect, in a possible implementation manner of the first aspect, the fault detection model is trained by using historical anomaly monitoring information of each node, where the historical anomaly monitoring information of each node includes a fault type and monitoring information corresponding to each fault type when each node is a first type node and a second type node respectively.

With reference to the first aspect, in a possible implementation manner of the first aspect, the historical anomaly monitoring information of each node is determined by:

the method comprises the steps of obtaining historical monitoring information of each node in a cluster network within preset duration, comparing the historical monitoring information of each node with a second preset condition, and removing monitoring information meeting the second preset condition from the historical monitoring information to obtain historical abnormal monitoring information of each node, wherein the historical monitoring information of each node comprises monitoring information corresponding to each node when the node is a first type node and a second type node respectively. The second preset condition is similar to the first preset condition, and is a corresponding condition which is required to be met when index data corresponding to each piece of monitoring information is normal.

In a second aspect, an embodiment of the present application provides a model training method, which is applied to an electronic device, and includes:

acquiring historical monitoring information of each node in a cluster network within a preset duration, and selecting historical abnormal monitoring information of each corresponding node from the historical monitoring information of each node; and training an initial fault detection model by utilizing the historical abnormal monitoring information of each node and the fault type of the historical abnormal monitoring information corresponding to each node to obtain a fault detection model. The historical abnormal monitoring information is the historical fault data.

With reference to the second aspect, in a possible implementation manner of the second aspect, acquiring history monitoring information of each node in the cluster network within a preset duration, and selecting history abnormal monitoring information of each corresponding node from the history monitoring information of each node; and training an initial fault detection model by utilizing the historical abnormal monitoring information of each node and the fault type of the historical abnormal monitoring information corresponding to each node to obtain a fault detection model. The preset duration is an empirical value or an experimental value, for example, the preset duration may be 30 days, etc., which is not limited in this application.

With reference to the second aspect, in a possible implementation manner of the second aspect, the historical abnormal monitoring information of each node includes a type of fault that occurs and monitoring information corresponding to each type of fault when each node is a first type node and a second type node respectively. That is, the history abnormality monitoring information of each node includes the failure type corresponding to each node when the node is the root cause node and the monitoring information corresponding to each failure type, and the failure type corresponding to each node when the node is the affected node and the monitoring information corresponding to each failure type.

With reference to the second aspect, in a possible implementation manner of the second aspect, the method further includes: acquiring real-time monitoring information of a first node in a cluster network, and acquiring abnormal monitoring information of the first node according to the real-time monitoring information; inputting the abnormal real-time monitoring information of the first node into a fault detection model to obtain first fault information of the first node, wherein the first fault information comprises the estimated fault type of the first node and fault probability corresponding to each estimated fault type, and

and determining the real fault type of the cluster network according to the estimated fault type of the first node and the fault probability corresponding to each fault type.

With reference to the second aspect, in a possible implementation manner of the second aspect, the initial fault detection model includes at least any one of the following: a convolutional neural network model, a fully-connected neural network model, or a feed-forward neural network model. In some implementations of the present application, the fault detection model may also be a random forest, decision tree, or the like, which is not limited in this application.

In a third aspect, embodiments of the present application also provide an electronic device including a memory storing computer program instructions; a processor coupled to the memory, which when executed by the processor, causes the electronic device to implement the method of any of the first aspects described above.

In a fourth aspect, embodiments of the present application also provide an electronic device including a memory storing computer program instructions; a processor coupled to the memory, which when executed by the processor, causes the electronic device to implement the method of any of the second aspects above.

In a fifth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method according to any one of the first aspects.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method according to any one of the second aspects above.

In a seventh aspect, embodiments of the present application provide a computer program product for, when run on an electronic device, causing the electronic device to perform the method of any one of the first aspects.

In an eighth aspect, embodiments of the present application provide a computer program product which, when run on an electronic device, causes the electronic device to perform the method of any one of the second aspects described above.

It will be appreciated that the advantages of the third to seventh aspects may be found in the related description of the first and second aspects, and are not described here again.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of an example of a fault detection method according to an embodiment of the present application;

FIG. 2 is an illustration of an example fault diagram provided in accordance with an embodiment of the present application;

FIG. 3a is a schematic representation of yet another example of a fault provided in accordance with an embodiment of the present application;

FIG. 3b is another example fault pictorial intent provided in accordance with an embodiment of the present application;

FIG. 4 is a flowchart illustrating an example method provided according to an embodiment of the present application;

FIG. 5 is an illustration of an example fault diagram provided in accordance with an embodiment of the present application;

FIG. 6 is a flow chart of yet another example method provided in accordance with an embodiment of the present application;

FIG. 7 is a graph of back pressure frame variation versus time provided in accordance with an embodiment of the present application;

FIG. 8 is a schematic diagram of the relationship between the number of anomaly log information and time provided in accordance with an embodiment of the present application;

FIG. 9 is a schematic diagram of a PING dial gauge versus time provided in accordance with an embodiment of the present application;

FIG. 10 is a graph of back pressure frame variation versus time provided in accordance with an embodiment of the present application;

FIG. 11 is a diagram of three characteristic values of PING dial measurements provided according to an embodiment of the present application;

FIG. 12 is an illustration of an example fault diagram provided in accordance with an embodiment of the present application;

FIG. 13 is another example fault pictorial view provided in accordance with an embodiment of the present application;

FIG. 14 is a flowchart of an exemplary method according to an embodiment of the present application;

FIG. 15 is an illustration of an example fault diagram provided in accordance with an embodiment of the present application;

FIG. 16 is another example fault pictorial view provided in accordance with an embodiment of the present application;

FIG. 17 is a flowchart of an exemplary method according to an embodiment of the present application;

FIG. 18 is an illustration of an example fault diagram provided in accordance with an embodiment of the present application;

FIG. 19 is a schematic diagram of an example of a pointing relationship provided according to an embodiment of the present application;

FIG. 20 is a schematic diagram of an exemplary system architecture according to an embodiment of the present application;

Detailed Description

Various aspects of the illustrative embodiments will be described using terms commonly employed by those skilled in the art. It should be understood that the embodiments described below are only some of the possible implementations of the present application, and not all of the embodiments. As a person of ordinary skill in the art can know, with the appearance of new application scenarios, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.

At present, when a fault occurs in a cluster network, a PING (phase-change agent) value among nodes in the cluster network is obtained by utilizing a PING technology mainly based on the topology structure of the cluster network, then the on-off state of a link among the nodes is determined according to whether the PING value meets a corresponding threshold condition, if the PING value of the link among two nodes does not meet the corresponding threshold condition, the link among the two nodes is disconnected, otherwise, the link among the two nodes is communicated. The PING technology refers to sending a test data packet to a device, checking whether the device has a response, and counting response time to test connectivity of a network. PING values refer to the time from when the test equipment sends data to the web server to when the web server feedback data is received, typically in milliseconds. In general, a smaller PING value indicates better network connectivity.

Specifically, the cluster network configuration shown in fig. 1 is taken as an example for explanation. The clustered network 1 comprises a network element device 101, a data acquisition device 102 and a network device 103. Where network element 101 refers to a device, such as a router, switch, server, etc., capable of independently performing a certain function or task. The data collection device 102 is configured to monitor and manage the clustered network 1, and collect relevant data of each device in the clustered network 1, for example, the data collection device 102 may collect monitoring information of the clustered network 1, including log data, index data of the network element 101, and various alarm information in the network. In some implementations, the data collection device 102 sends the collected data to a database for storage. The network device 103 is a device for performing the fault detection method described in the implementations of the present application, and is used for performing fault detection on the clustered network 1. In an implementation manner of the present application, the network device 103 may be any device of the network element 101 or the data acquisition device 102, or may be independent of the network element device 101 and the data acquisition device 102, and may be capable of communicating with other devices in the clustered network or other devices outside the clustered network with the network element 101, the data acquisition device 102, which is not limited in this application. Each device shown in fig. 1 may be referred to as a node device or a node in the clustered network topology.

When the prior proposal is used for carrying out fault detection on the trunking network by the PING technology, the PING data test packet can be sent to each device only according to the links among the devices in the trunking network, and then the on-off condition of the links is determined according to the relationship between the magnitude of the PING value returned by the device and the PING threshold condition. Other fault types of the clustered network, such as degradation of quality of the device, degradation of transceiving capability of the device itself, degradation of bandwidth performance, and the like, cannot be detected only by the PING value.

In order to solve the technical problem, the application provides a fault detection method. Aiming at the problem that the PING technology can only identify the on-off of links in the cluster network, the fault detection method trains the fault detection model by acquiring historical fault data corresponding to various types of faults occurring at each node in the cluster network, so that the fault detection model can detect other fault types occurring in the cluster network.

It will be appreciated that the types of failures mentioned in some embodiments of the present application include link on-off, degradation of device quality, lengthy network connectivity delays, etc., which are not limiting in this application.

It may be appreciated that the historical fault data mentioned in some embodiments of the present application may include faults that occur at each node in the clustered network, a fault type corresponding to the fault, and node attributes of the node when the fault occurs. Wherein the node attribute refers to whether the node is an affected node or a root node when the failure occurs. Affected nodes refer to nodes that fail under the influence of the failure of other nodes. Root node means that the failure of the node is not affected by the failure of other nodes, and the failure of the node affects other nodes, resulting in failure of other nodes. For example, if a node X has a link failure, which is affected by the link failure of the node X, and a node Y has a data packet loss failure, then the node X is the root node, and the node Y is the affected node.

In some implementations of the present application, the fault detection model may be a neural network model such as a feedback neural network model (feedback neural network, FNN), a fully connected neural network (full connect neural network, FCNN), a convolutional neural network (convolution neural network, CNN), a recurrent neural network (recurrent neural network, RNN), and the like, which is not limited in this application. In some implementations of the present application, the fault detection model may also be a random forest, decision tree, or the like, which is not limited in this application.

Specifically, the fault detection method trains the fault detection model by utilizing historical fault data, so that the fault detection model can detect the fault type of faults possibly occurring in each node in the cluster network and the fault probability corresponding to each type of faults occurring in each node according to the real-time fault data. The historical fault data comprise fault types and fault data which are generated when each node in the cluster network is a root cause node and fault types and fault data which are generated when each node is an affected node.

In some implementations, as noted above, failure of one node may affect other nodes, i.e., there may be a link between the failures of each node. Therefore, the failure occurring at each node in the clustered network can be represented by a failure propagation graph (hereinafter referred to as failure graph).

That is, in some implementations, the fault map may be used to represent historical fault data of each node, and then the fault detection model is trained using the fault map corresponding to the historical fault data of each node, to obtain a trained fault detection model. It can be understood that the trained fault detection model can detect the possible fault type of a certain node and the corresponding fault probability according to the input real-time monitoring information of the certain node. In some implementations, the real-time monitoring information of each node refers to various types of index data of each node used for monitoring the running condition of the cluster network. For example, log alert data of each node, network bandwidth of each node communication, delay data, time/iteration duration of each node operation, and the like, which is not limited in this application.

And then when detecting faults of the cluster network, similarly, representing the real-time monitoring information of each node by using the fault graph, inputting the fault graph corresponding to the real-time monitoring information of each node into a fault detection model, analyzing the fault detection model according to the real-time monitoring information of each node, and outputting the possible fault types of each node and the corresponding fault probability thereof.

In addition, it can be understood that the cluster network has more comprehensive monitoring information, and the running condition of the cluster network can be judged through the monitoring information. Thus, in some implementations, it may be assumed that a node has all of the monitoring information in the current clustered network, and a full fault map (hereinafter referred to as a first fault map) for the node is generated using the all of the monitoring information. Illustratively, fig. 2 shows a first failure schematic representation of a node, wherein numerals 1 through N within a solid circle represent all monitoring information. The manner in which the first fault map is specifically generated will be described below.

And pruning the first fault graph by utilizing the historical fault data of the node to obtain a second fault graph reflecting the actual fault condition of the node. In some implementations, the monitoring information with normal indexes in the first fault map can be removed according to the historical fault data of the node, and the second fault map after pruning can be obtained. Fig. 3 a-3 b illustrate a second fault diagram obtained after pruning a first fault diagram of a certain node according to historical fault data of the node, wherein numbers in a dotted circle in fig. 3a represent monitoring information with normal indexes, and fig. 3b represents a second fault diagram obtained after removing the monitoring information with normal indexes and representing an actual fault condition of the node.

And then, obtaining a second fault map of each node based on the historical fault data of each node in the same way, and training a fault detection model by using the second fault map of each node, so that the fault detection model can detect the corresponding real-time monitoring information and the possible fault types of each node and the corresponding fault probability according to the second fault map corresponding to the real-time monitoring information of each node. The second fault diagrams corresponding to the real-time monitoring information of the nodes are obtained by pruning the first fault diagrams of the nodes by utilizing the real-time monitoring information of the nodes. The manner in which the first fault map is pruned to obtain the second fault map will be described below.

By the method, the fault detection model is utilized to analyze and detect the monitoring information of each node in the cluster network in real time, and the possible fault types of each node and the corresponding fault probability are determined.

In addition, if node X and node Y are interconnected nodes or traffic or data handled by node X and node Y have a dependency relationship (e.g., input data of node X depends on output data of node Y), i.e., node X and node Y are adjacent to each other, then the same type of failure is more likely to occur between node X and node Y. Therefore, after the fault detection model of the present application is used to detect the faults occurring in the cluster network, the possible fault types of each node of the cluster network and the probability corresponding to the fault types are obtained, a certain node can be assumed to be a root node, and then the probability of each type of fault corresponding to the node is adjusted according to the probability of each type of fault corresponding to the adjacent node of the node. And then, by using the same mode, adjusting the probability of each type of fault corresponding to each node in the cluster network, selecting the node with the highest fault probability from the fault types corresponding to each node after adjustment and the fault probabilities corresponding to each type as the root cause node of the fault of the cluster network, and taking the fault type corresponding to the root cause node as the fault type of the cluster network.

By the method, under the condition that the cluster network fails, the failure type of the failure of the cluster network and the root cause node corresponding to the failure can be detected, so that research personnel can take targeted measures, the failure problem is eliminated, and the operation of the cluster network is effectively maintained.

In some implementations, in order to locate the fault type and root cause node of the cluster network more quickly, a random walk algorithm is used to select the root cause node of the node with the largest arrival times as the cluster network according to the fault type corresponding to each node after adjustment and the fault probability corresponding to each type, and at this time, the fault type with the largest probability when the node is the root cause node is the fault type of the cluster network. The manner in which the root node of the cluster network is determined using the random walk algorithm will be described below.

For convenience of description, the root cause node is referred to as a first class node, the affected node is referred to as a second class node, and the possible fault types and probability of each node of the cluster network obtained after the fault detection model detection are referred to as first fault information.

In order to facilitate understanding the implementation process of the fault detection method of the present application, the following description is provided with reference to other figures on the basis of the cluster network structure shown in fig. 1.

The method for constructing the first fault map in the fault detection method of the present application will be described first. The execution body of the method corresponding to each embodiment of the following figures may be any device in the above-mentioned clustered network or any device outside the clustered network, and the execution body is exemplified as the network device 103 for convenience of understanding. Fig. 4 shows a method for constructing a first fault diagram, as shown in fig. 4, the method includes:

and 401, collecting monitoring information. In some implementations, the monitoring information of the clustered network may be collected by the data collection device 102 described above. The monitoring information can be log alarm data of the cluster network, network bandwidth of communication between devices, time delay data, time/iteration duration of operation of any device in the cluster network and the like.

And 402, classifying the monitoring information according to preset rules. In some implementations, the acquired monitoring information may be partitioned according to respective attribute information. For example, jitter delay information representing a communication state between devices in a clustered network may be divided into a first type of monitoring information, monitoring information representing a loss of communication data between devices may be divided into a second type of monitoring information, monitoring information representing a network congestion may be divided into a third type of monitoring information, and monitoring information representing a link failure between devices may be divided into a fourth type of monitoring information.

Illustratively, the correspondence between each monitoring information and the category to which each monitoring information belongs may be as shown in table 1A below:

table 1A correspondence between each monitoring information and its category

It will be appreciated that the above classification is merely an example, and in other implementations, the monitoring information may be classified according to other classification rules, and the number of classified categories may be less than the four categories, or may be more than the four categories, which is not limited in this application.

And 403, associating the classified monitoring information with the network layer to which the monitoring information belongs to obtain a first fault diagram. It will be appreciated that nodes may generally be divided into an application layer, a messaging interface layer (Message Passing Interface, MPI), a remote direct data reading (remote direct memory access, RDMA) layer, an IP/ETH (Internet Protocol/ethernet) layer, etc., each monitoring message may reflect whether a certain service at a certain layer of the node is normal. Thus, each monitoring information is associated with the network layer to which it belongs, which can be understood as actually acquiring the monitoring information of each layer of the node.

And then, based on the classified monitoring information and the network layer to which the monitoring information belongs, obtaining a first fault diagram shown in fig. 5. Wherein numerals 1 to N in circles represent various monitoring information.

It can be appreciated that the purpose of 402 and 403 in the method of the present application is to be able to quickly locate the exact location of the occurrence of the fault according to the fault chart, and to facilitate the developer to understand various faults occurring in each layer after a certain type of fault occurs, so that the developer can quickly remove the corresponding fault. If necessary, in some implementations, the monitoring information may not be classified, and the first failure map shown in fig. 5 is directly built according to the monitoring information instead of the monitoring information and the layers to which the monitoring information belongs, so that the difference between fig. 5 and fig. 2, 3a and 3b is that fig. 5 has a clear network layer and a classification of the monitoring information, and fig. 2, 3a and 3b are the first failure maps directly built according to the monitoring information, but are both the first failure maps essentially formed by the monitoring information reflecting whether the node fails or not. The present application is not limited in this regard.

After the first fault diagram of each node is obtained through the method of fig. 4, the first fault diagram can be pruned by utilizing the actual fault data of the node to obtain a second fault diagram representing the actual fault condition of the node. The process of fault propagation graph instantiation for each node is described below in connection with FIG. 6. As shown in fig. 5, the method includes:

601, acquiring monitoring information of a node. In some implementations, the monitoring information within the first preset duration may be acquired by the data acquisition device 102, so as to obtain monitoring information that can more stably represent whether the node is faulty or not. The first preset duration is an empirical value or an experimental value, and the value of the first preset duration can be 5 seconds, for example.

In some implementations, the monitoring information of the node includes key performance indicator (key performance indicator, KPI) information, log information, and PING dial alert information of the node. The KPI information of the node may include various KPI values, such as a PING bandwidth value, an RMDA bandwidth value, etc. of the node. The log information may also include various types of log information, such as link anomaly log information, packet loss log information, packet error log information, and so forth.

Illustratively, in some implementations, the amount of backpressure frame variation obtained within 5 seconds may be as shown in fig. 7, where the horizontal axis represents time and the vertical axis represents the amount of backpressure frame variation corresponding to a certain moment. In some implementations, the acquisition of the original link anomaly log information within 5 seconds may be as shown in fig. 8, where the horizontal axis represents time and the vertical axis represents the amount of the log information corresponding to a certain moment. In some implementations, the acquisition of the original PING dial testing alarm information within 5 seconds may be as shown in fig. 9, where the horizontal axis represents time and the vertical axis represents PING dial testing alarm information corresponding to a certain moment.

In some implementations, in order to facilitate comparing the monitoring information with the corresponding threshold condition, and obtaining the abnormal feature vector of the node according to the comparison result, the collected back pressure frame variation, the original link abnormal log information, the original PING value dial test alarm information and the like may be processed to obtain a more concise diagram.

For example, taking the back pressure frame variation as an example, a back pressure frame variation greater than or equal to the back pressure frame variation threshold may be denoted by "1", and a back pressure frame variation lower than the back pressure frame variation threshold may be denoted by "0". The back pressure frame change amount threshold value is an empirical value or an experimental value. Then for fig. 7, the back pressure frame variation can be obtained within 5 seconds as shown in fig. 10, where the horizontal axis still represents time and the vertical axis represents PING bandwidth value. Similar methods can be adopted for normalization of other KPI values, and a simplified diagram corresponding to the KPI value is obtained.

It will be appreciated that in other implementations, similar normalization may be performed in other ways. For example, for link abnormality log information, link abnormality log information may be indicated by "1", and no link abnormality log information may be indicated by "0". Or "0" for no link abnormality log information, "1" for link abnormality log information, but the number thereof is smaller than a certain value, "2" for link abnormality log information, and the number thereof is larger than a certain value, and so on. The present application is not limited in this regard.

And 602, comparing the monitoring information with corresponding conditions, and obtaining an abnormal characteristic vector of the node according to the comparison result. In some implementations, since the monitoring information includes multiple monitoring information such as KPI information, log information, and PING dial testing alarm information of the node, when comparing the monitoring information with the corresponding conditions, the monitoring information needs to be compared with the KPI information, the log information, and the PING dial testing alarm information respectively. For example, whether KPI information meets a preset first condition, whether log information meets a preset second condition, and whether PING dial testing alarm information meets a preset third condition are judged.

In addition, since the KPI information of the node often includes multiple KPI values, for example, the KPI information of the node includes a PING bandwidth value, a reverse pressure frame variation amount, a packet loss amount of the receiving end, and the like, when comparing the KPI information of the node with the preset first condition, it is also necessary to separately determine each KPI value, that is, compare each KPI value with a threshold condition corresponding to each KPI value. For example, the PING bandwidth value of the node is compared with a PING bandwidth threshold condition, the counter-pressure frame variation is compared with a counter-pressure frame variation threshold condition, and the packet loss at the receiving end is compared with a packet loss at the receiving end threshold condition.

In some implementations, when comparing a certain KPI value, a KPI feature value corresponding to the KPI value may be determined according to a set of preset KPI change features corresponding to the KPI value in the first preset time period, and then the KPI feature value is compared with a preset KPI feature value, if the KPI feature value is consistent with the preset KPI feature value, the KPI value is normal, and if the KPI feature value is inconsistent with the preset KPI feature value, the KPI value is abnormal. The preset KPI change characteristic set is obtained according to the collected historical data of various KPI values of each node and according to the change characteristics of the historical data, and the preset KPI change characteristic set corresponding to a certain KPI value can represent the possible change condition of the KPI value under any condition.

For example, taking the example that the KPI value is a PING bandwidth value, assume that the preset KPI feature set of the PING bandwidth value within 5 seconds includes three kinds of variation feature values "1", "2", "3", as shown in fig. 11, and the PING bandwidth value corresponding to the variation feature of the PING bandwidth value belongs to "1" as shown in fig. 10. Assuming that the preset KPI feature value corresponding to the PING value bandwidth value is "1", that is, the PING bandwidth value change shown in fig. 10 is a normal change corresponding to the cluster network when no fault occurs, the PING bandwidth value is normal. Assuming that the preset KPI feature value corresponding to the PING bandwidth value is "2", that is, the PING bandwidth value change shown in fig. 10 is an abnormal change corresponding to the occurrence of a fault in the clustered network, the PING bandwidth value is abnormal. The preset KPI feature value is an empirical value or an experimental value, and can be set by a research and development personnel according to experience. The above-mentioned characteristic values "1", "2" and "3" are used to indicate a numerical transformation of a certain KPI value, for example, a waveform shape corresponding to a certain KPI value, and more specifically, for example, the characteristic value "1" indicates a first waveform diagram shown in fig. 11, the characteristic value "2" indicates a second waveform diagram shown in fig. 11, and the characteristic value "3" indicates a third waveform diagram shown in fig. 11. In other implementations, the waveform diagram shape corresponding to a KPI value may be represented by other values of the user, which is not limited in this application.

Similarly, by using the same method, KPI feature values corresponding to each KPI value can be obtained. In some implementations, "0" may be used to represent KPI values where each KPI value is consistent with a corresponding preset KPI value, and "1" may be used to represent KPI values where each KPI value is inconsistent with a corresponding preset KPI value, so as to obtain an abnormal feature vector corresponding to the node KPI information. For example, assuming that the KPI information of the node includes 3 KPI values (PING bandwidth value, RMDA bandwidth value, mpi bandwidth value), the KPI feature values corresponding to the three KPI values are [1,2,3] respectively, and the preset KPI values corresponding to the three KPI values are [1, 2] respectively, the abnormal feature vector of the KPI information of the node obtained by comparison is [0, 1], that is, in the KPI information of the node, the PING bandwidth value is normal, the RMDA value is abnormal, and the mpi value is abnormal.

And for the log information and the PING value dial testing alarm information, the abnormal feature vector corresponding to the log information and the abnormal feature vector corresponding to the PING value dial testing alarm information can be obtained in a similar way. Illustratively, it is assumed that the log information includes 2 kinds of packet loss log information and link interruption log information. Assuming that the characteristic value of the data packet loss log information in the first preset duration is 1 and the characteristic value of the link interrupt log information is also 1, the characteristic values corresponding to the two types of log information are [1,1]. And if the preset characteristic value corresponding to the two types of log information is [0,0], comparing to know that the abnormal characteristic vector corresponding to the log information is [1,1], namely, in the log information of the node, the data packet loss log information and the link interruption log information are abnormal. For example, if the characteristic value of the PING value dial testing alarm information in the first preset duration is 3, and if the preset characteristic value corresponding to the PING value dial testing alarm information is 2, the abnormal characteristic vector of the PING value dial testing alarm information is 1, that is, the PING value dial testing alarm information of the node is abnormal.

And obtaining the abnormal characteristic vector of the node KPI information, log information and PING dial testing value alarm information. For example, the above-mentioned abnormal feature vector [0, 1] of KPI information, the abnormal feature vector [1,1] of log information, and the abnormal feature vector [1] of PING value dial alert information are taken as examples, and the abnormal feature vector of the node is [0,1,1,1,1,1].

It can be understood that the abnormal feature vector of the node is related to the KPI information, the log information and the number corresponding to the PING value dial test alarm information corresponding to the node, for example, if the KPI information corresponding to the node is 7, the log information is 6 and the PING dial test alarm information is 4, the abnormal feature vector of the node may be [1,1,1,0,0,0,1,0,1,1,1,1,0,0,1,1,0]. The present application is not limited in this regard.

603, pruning the first fault diagram of the node according to the abnormal feature vector of the node to obtain a second fault diagram of the node. That is, the first fault diagram of the node is instantiated by using the abnormal feature vector of the node, and a second fault diagram conforming to the abnormal feature of the node is obtained.

For example, the first failure graph of the node shown in fig. 5 above is exemplified by circles 1 to 7 representing KPI information, circles 8 to 13 representing log information, and circles 14 to 17 representing PING dial alert information.

Assuming that the abnormal feature vector of the node is [1,0,1,1,0,0,1,1,0,1,1,0,0,0,0,0,1], the KPI values represented by circles 1, 3, 4 and 7 are abnormal, the log information represented by circles 8, 10 and 11 is abnormal, the PING dial testing alarm information represented by circle 17 is abnormal, and other information is normal. Therefore, for the first fault diagram shown in fig. 3, the KPI information, the log information and the PING dial alert information of the first fault diagram are pruned, and the abnormal information is reserved, that is, as shown in fig. 12, the dotted circles represent the normal KPI information, the log information and the PING value information, and after the normal information is pruned, the second fault diagram of the node shown in fig. 13 can be obtained.

The above is a process of obtaining a second failure map of a node for a first failure map of the node, that is, instantiating the first failure map of the node. In this way, a fault map conforming to the node anomaly characteristic can be obtained. And because the second fault diagram is pruned on the basis of the first fault diagram corresponding to the node, and based on the comprehensiveness of the first fault diagram (namely including known all monitoring information), the second fault diagram can also include all acquired information capable of indicating whether the node is faulty or not, so that the fault condition of the node can be accurately analyzed later.

The above describes the process of constructing a first fault map of a node and pruning the first fault map to obtain a second fault map. The method for training a fault detection model by using the second fault diagram of the node and detecting the node fault in the cluster network by using the fault detection model will be described below.

First, a method of training a fault detection model using a second fault graph of nodes is described. As shown in fig. 14, the method includes:

1401, historical fault data of each node in the cluster network is obtained. The method for obtaining the historical fault data of each node in the cluster network may refer to the method for obtaining the monitoring information of the node, which is not described herein. The fault data of each node comprises KPI information, log information and PING dial testing alarm information of abnormality of each node.

In some implementations, the historical fault data for each node includes fault data corresponding to a node experiencing various types of faults, and attributes of the node corresponding to a certain type of fault type. The attribute of the node is that the node is the root node or the affected node, and the affected node is the node which is affected by the failure of the root node and also fails. Wherein the root cause node is called a first type node, and the affected node is called a second type node.

For example, as shown in fig. 15, the history fault data of the a node includes fault data of the a node having generated the first type fault, fault data … of the second type fault, fault data of the N-th type fault, and an attribute of the a node corresponding to the first type fault (e.g., the a node is the first type node), an attribute of the a node corresponding to the second type fault (e.g., the a node is the second type node) …, and an attribute of the a node corresponding to the N-th type fault (e.g., the a node is the first type node).

1402, obtaining a second failure graph of each node based on the historical failure data of each node and the first failure graph of each node. The first fault graph of each node may be constructed by referring to the methods 401 to 403. After the historical fault data of each node is obtained, the abnormal feature vector of each node can be obtained through the method 602, and then pruning can be performed on the first fault graph of each node by using the method 603 to obtain the second fault graph of each node. More specifically, since the history failure data of each node includes the failure type that each node has occurred and the attribute of each node corresponding to the failure type, the second failure map of each node for the different failure types can be obtained.

For example, continuing with the example of the historical failure data of the a node shown in fig. 15, a correspondence map between the a node and the failure type and the a node attribute can be obtained via 1302 as shown in fig. 16. Specifically, as shown in fig. 16, the node a is a second failure map 1601 corresponding to a first type node of a first type failure, the node a is a second failure map 1602 … a corresponding to a second type node of a second type failure, and the node a is a second failure map 160N corresponding to a second type node of an nth type failure.

1403, training a fault detection model based on the second fault map and the historical fault data for each node. It can be understood that, because the historical fault data includes the fault type of each node and the attribute of each node corresponding to the fault type, and the second fault diagram of each node for each type of fault is obtained through 1402, the fault detection model can be trained by using the second fault diagram of each node for each type of fault and the node attribute of each node for each type of fault.

In some implementations, the fault detection model may be a convolutional neural network model or other neural network model that may be used for machine learning, which is not limited in this application.

Taking the fault detection model as a convolutional neural network model as an example, in some implementation manners, the second fault diagrams and fault types corresponding to the second fault diagrams, attribute information of nodes and the like can be divided into training data, verification data and test data, then the training data is utilized to train the fault detection model, the verification data is utilized to verify the fault detection model, and finally the test data is utilized to detect whether the fault detection model meets the requirements. The manner of dividing the training data, the verification data and the test data may be set empirically by a developer, for example, 60% of the data in the fault type and the attribute information of the node corresponding to the second fault diagrams are divided into the training data, 20% of the data are divided into the verification data, and 20% of the data are divided into the test data. In some implementations, other data dividing methods may be used, for example, only 80% of the second fault diagrams and the fault types corresponding to the second fault diagrams, and the attribute information of the nodes are divided into training data and 20% of the second fault diagrams are divided into test data. It should be understood that the method of model training is a mature prior art, and those skilled in the art can know the data partitioning method related to the model training method without performing creative effort, so the application is not limited thereto.

In some implementations, in the training process of the fault detection model, first output data corresponding to input data is required to be compared with preset output data corresponding to the input data, where the first output data is output data processed by the fault detection model, the preset output data is reference data corresponding to the input data, that is, a fault type corresponding to a certain second fault diagram of a certain node and an attribute of the node, if similarity between the first output data and the preset output data is greater than a first threshold value, the detection result of the fault detection model is indicated to be more accurate, if similarity between the first output data and the preset output data is less than the first threshold value, the detection result of the fault detection model is indicated to be inaccurate, the training data is required to be continuously used to train the fault detection model, and parameters such as weight, bias and the like of each network layer in the fault detection model are modified at the same time, so that the similarity between the first output data of the input data and the preset output data is greater than the first threshold value, and even if the detection result of the fault detection model meets the requirement. Wherein the first threshold is an empirical or experimental value. In some implementations, the manner of calculating the similarity between the first output data and the preset output data may be to extract feature vectors corresponding to the first output data and the preset output data (i.e. vectorize the first output data and the preset output data), and calculate any one of a vector euclidean distance, a cosine distance, or a hamming distance between the two feature vectors. In addition, the vectorizing process of the first output data and the preset output data belongs to the prior art, and can be known by those skilled in the art without performing creative labor, so that details are not repeated here.

For example, continuing to take the corresponding relationship diagram between the a node and the fault type and the attribute of the a node shown in fig. 16 as an example, regarding the second fault diagram corresponding to the first type fault of the a node in the training data as input data, the first type fault and the attribute (the first type node) of the a node corresponding to the first type fault as the preset output data, inputting the second fault diagram into the fault detection model to obtain the first output data corresponding to the second fault diagram, comparing the similarity between the first output data and the preset output data, and determining whether the fault detection model meets the requirement according to the relationship between the similarity and the similarity threshold.

In some implementations, the failure detection model after training is completed can be expressed as the following equation 1:

wherein, |X _i，k ，X _j，k I represents the similarity between two fault diagrams under a certain fault type corresponding to a certain node, and X _i，k It may be represented that X represents a node, for example, X may be a in fig. 16, represents a node a, i represents a second fault diagram corresponding to a certain type of fault of the node X, for example, a second fault diagram corresponding to a first type of fault of the node a, k represents certain monitoring information of the node, for example, k may represent a first type of monitoring information of the node a, T _k Representing features included in type k monitoring information, e.g. T _k Features representing a first type of monitoring information for node a (such as the first type of monitoring information for node a shown in fig. 5 includes circles 1, 2, 3, 4, 7, 8, 9, and thus the first type of monitoring information is characterized by [1,0,1,1,1,1,0,])，b _i an abnormal feature vector representing certain information b in the second fault diagram, e.g. b representing KPI information of node a, such as shown in fig. 5, includes circles 1 to 7, so b _i ＝[1，0，1，1，0，0，1]。

In other words, when the fault detection model is used for fault detection, similarity comparison is performed between the second fault graph of a node obtained according to the real-time monitoring information corresponding to the node and each second fault graph obtained according to the historical fault data of each node in the cluster network, so as to obtain the similarity between the second fault graph corresponding to the real-time monitoring information of the node and each second fault graph obtained according to the historical fault data of each node.

A method for detecting a node failure in a clustered network by using the failure model described above is described below with reference to fig. 17, where the content corresponding to those shown in fig. 4, 6 and 14 is referred to above for related description, and will not be repeated. As shown in fig. 16, the method includes:

1701, acquiring real-time monitoring information of the cluster network.

1702, obtaining a second fault diagram of each node based on the real-time monitoring information of each node and the first fault diagram of each node.

1703, obtaining first fault information of each node by using a fault detection model based on the second fault diagram of each node. The first fault information of each node comprises a fault type of each node, probability corresponding to the fault type and node attribute of each node under various fault types. For example, for the first fault information of the node B, it may include a probability that the node B has a first type of fault, a probability that the node B has a second type of fault, …, a probability that the node B has an nth type of fault, and an attribute of the node B under the first type of fault (e.g., the node B under the first type of fault may be the first type of node B), an attribute of the node B under the second type of fault (e.g., the node B under the first type of fault may be the second type of node B), …, and an attribute of the node B under the third type of fault (e.g., the node B under the first type of fault may be the first type of node B).

Illustratively, for ease of understanding, it is assumed that the second failure map obtained based on the real-time monitoring information of the node B, C, D, E is detected using a failure detection model, and the first failure information of the node B, C, D, E is obtained as shown in fig. 18. After fault detection, the probability of possible fault types of the node B is obtained, and the correspondence between node attributes is shown in the following table 1B:

Table 1 probability of possible failure types of node B, correspondence between node attributes

In some implementations, to facilitate comparative calculation, the fault type, node attribute, and probability corresponding to a node may be expressed in the form of matrices { t1, t1', t2, t2', t3, t3',..tn, tn ' }, where tn represents the probability that the node is in an nth type of fault and is a first type of node, and tn ' represents the probability that the node is in an nth type of fault and is a second type of node. Illustratively, the fault types, node attributes and probabilities corresponding to the node bs in table 1B above are represented in the form of a matrix, which may be: {0.1,0.5,0,0, …,0}.

After fault detection, the probability of possible fault types of the node C is obtained, and the corresponding relationship between the node attributes is shown in the following table 2:

table 2 probability of possible failure types for node C, correspondence between node attributes

Also, illustratively, the fault type, node attribute, and probability corresponding to node C in table 2 above are represented in the form of a matrix, which may be: {0.2,0.4,0.75,0.3, …,0.7,0.2}.

After fault detection, the probability of possible fault types of the node D is obtained, and the corresponding relationship between the node attributes is shown in the following table 3:

Table 3 probability of possible failure types for node D, correspondence between node attributes

Also, illustratively, the fault type, node attribute, and probability corresponding to node D in table 3 above are represented in the form of a matrix, which may be: {0.8,0.6,0.3,0.7, …,0.01,0.03}.

After fault detection, the probability of possible fault types of the node E is obtained, and the corresponding relationship between the node attributes is shown in the following table 4:

table 4 probability of possible failure types for node E, correspondence between node attributes

Also, illustratively, the fault type, node attribute, and probability corresponding to node E in table 4 above are represented in the form of a matrix, which may be: {0.1,0.01,0.8,0.5, …,0.5,0.6}.

1704, adjusting the first fault information of each node by using the first fault information corresponding to the adjacent node of each node, so as to obtain the second fault information of each node. It will be appreciated that nodes adjacent to a node are more prone to failure due to the failure of that node, and that the failure types of adjacent nodes tend to be the same. Therefore, the first fault information of the node can be adjusted by using the first fault information of the adjacent node of the certain node to obtain the second fault information of the node. And then, the first fault information of each node is adjusted by using the same method, and the second fault information of each node is obtained. In some implementations, adjacent nodes include nodes that have a physical communication relationship with each node, e.g., two nodes are connected by a network cable, and then the two nodes are adjacent to each other. In other implementations, adjacent nodes also include nodes that have business or data dependencies between each other, e.g., input data of one node depends on output data of another node, and then the two nodes are also adjacent nodes to each other.

Illustratively, taking the first failure information of node B, C, D, E shown in fig. 18 as an example, as shown in fig. 18, the neighboring nodes of node B are node D and node E. The first type of fault information corresponding to node B is represented as a matrix {0.1,0.5,0,0, …,0}, and the first type of fault information corresponding to node D is represented as a matrix: {0.8,0.6,0.3,0.7, …,0.01,0.03}, the first type of failure information corresponding to node E is represented as a matrix: {0.8,0.6,0.8,0.5, …,0.5,0.6}.

It is assumed that the first fault information of node B is adjusted using the first fault information of node D and node E, i.e. it is assumed that node B is a first class of nodes and node D and node E are a second class of nodes. Since the fault detection model detects that the possible fault type of the node B includes the first type of fault, and the fault detection model detects that the fault type of the node D includes the first type of fault, the second type of fault, …, and the nth type of fault, and the fault detection model detects that the fault type of the node E includes the first type of fault, the second type of fault, …, and the nth type of fault, the probability of being the first type of fault is higher for the node B, so that the probability of being the first type of node B is improved. For example, the probability of the node B first type of failure first type node is raised from 0.1 by a preset magnitude to 0.5. The preset amplitude is an empirical value or an experimental value, and can be set by a research and development personnel according to requirements. For example, the developer sets the preset amplitude to a form related to the probability of the second class node in the neighboring nodes, for example, the initial value of the preset amplitude is 0, the probability that one neighboring node exists in the second class node is greater than 50%, the preset amplitude is increased by 10%, that is, if the probability that multiple (for example, 3) neighboring nodes exist in the second class node is greater than 50%, the preset amplitude is 30%. The present application is not limited in this regard.

And then, the probability of the first class node of the node B is also adjusted by adopting the same method aiming at other class fault types, so that second fault information after the node B is adjusted can be obtained. Illustratively, the node B adjusted second fault information may be {0.5,0.5,0.5,0, …,0.25,0}.

And similarly, aiming at other nodes in the cluster network, the second fault information of each node can be obtained by adopting the method to adjust. Illustratively, it is assumed that the second fault information of the node C is {0.45,0.4,1.1,0.3, …,0.71,0.2}, the second fault information of the node D is {1.2,0.6,0.55,0.7, …,0.3}, and the second fault information of the node E is {0.36,0.01,1.25,0.5, …,0.52}, after the adjustment for the node C, D, E.

1705, determining a plurality of first type nodes corresponding to the fault of the cluster network according to the second fault information of each node. After the second fault information of each node is obtained, a plurality of first-class nodes can be determined according to the probability in the second fault information of each node. In some implementations, nodes with probability smaller than a first threshold value in the first type of nodes in each fault type in the second fault information can be removed, so that a plurality of first type of nodes corresponding to various faults are obtained. The first threshold is an empirical value or an experimental value, for example, the value of the first threshold may be 0.5. For example, taking the adjusted second fault information of the node B in 504 as {0.5,0.5,0.5,0, …,0.25,0}, the second fault information of the node C as {0.45,0.4,1.1,0.3, …,0.71,0.2}, the second fault information of the node D as {1.2,0.6,0.55,0.7, …,0.3,0.03}, the second fault information of the node E as {0.36,0.01,1.25,0.5, …,0.52,0.6}, and removing the nodes with the probability of the first class node being less than 0.5 in the various fault types, so as to obtain the various faults, the first class nodes corresponding to the various faults, and the correspondence between the probabilities as shown in the following table 5.

TABLE 5 correspondence between types of faults, first class nodes corresponding to types of faults, and probabilities

Namely, in the cluster network, the probability that the first type of faults occur is 0.5, and the probability that the first type of nodes are node B is 1.2; the probability of the second type of faults is 0.5, the probability of the first type of nodes is node B, the probability of the first type of nodes is 0.55, the probability of the second type of faults is node E, and the probability of the first type of nodes is 1.25; the probability of the second type of faults occurring is 0.71, and the probability of the first type of nodes being the node C, and the probability of the second type of faults occurring is 0.52.

1706, determining whether there is a first type node with a failure probability greater than a failure probability threshold among a plurality of first type nodes in the clustered network. The fault probability threshold is an empirical value or an experimental value, and the value of the fault probability threshold can be 0.8, for example. That is, the nodes most likely to be the first class nodes are selected from the plurality of first class nodes.

1707, under the condition that the first type node with the fault probability larger than the fault probability threshold exists in the plurality of first type nodes in the cluster network, taking the fault type corresponding to the first type node with the largest fault probability as the fault type of the cluster network. That is, if the node most likely to be the first type node can be screened, the fault type corresponding to the node is the fault type of the cluster network.

For example, the above-mentioned various faults and the first type nodes corresponding to the various faults shown in table 5 are taken as examples, and the first type nodes with the fault probability smaller than 0.8 are removed, so that the first type nodes, the fault types corresponding to the first type nodes and the probability shown in the following table 6 can be obtained:

TABLE 6 first class node, failure type and probability corresponding to the first class node

First class node	Fault type	Probability of
			Node D	Failure of the first type	1.2
Node E	Failure of the second type	1.25
			……	……	……

Namely, the probability that the first type node corresponding to the first type fault occurs in the cluster network is node D is 1.2, and the probability that the first type node corresponding to the second type fault occurs in the cluster network is node E is 1.25 and …. Then, the second type of fault corresponding to the node E is a fault occurring in the cluster network, and the first type of node of the fault is the node E.

In some implementations, if there are a plurality of fault types whose fault probability is greater than the fault probability threshold, one may be selected from any of the fault types as the fault type of the cluster network, or the plurality of fault types may all be selected as the fault type of the cluster network, which is not limited in this application.

1708, marking the fault type of the cluster network as an unknown fault type under the condition that no first type node with the fault probability larger than the fault probability threshold exists in the plurality of first type nodes in the cluster network. It will be appreciated that if the nodes most likely to be nodes of the first class cannot be screened out by the above method, this indicates that the fault type may belong to an unknown fault type. For the unknown fault type, the fault cause can be determined by manual investigation, and then used as historical fault data, and the fault detection model is retrained by using the method shown in the figure 14 so as to facilitate the detection of the subsequent cluster network faults. The present application is not limited in this regard.

In other implementations, after the second fault information of each node is obtained according to 1701 to 1704, a first type node and a fault type of the cluster network fault are determined from each node by using a random walk algorithm. The random walk algorithm refers to that from any node of the graph, the probability of (1-a) is found to walk to a neighbor node of the node, and the probability a is randomly jumped to any vertex in the graph, wherein a is the jump forward probability, and then a probability distribution is obtained after each walk, and the probability distribution represents the probability of each vertex in the graph being visited. This probability distribution is used as input for the next walk and the process iterates over. After the iteration times of random walk accord with the time threshold, a stable probability distribution can be obtained, and then the node with the highest probability in the probability distribution can be regarded as the first type node of the cluster network, and meanwhile, the fault type with the highest fault probability corresponding to the node in the first type node is the fault type of the cluster network.

Specifically, the above-mentioned process includes:

(1) And adding the probabilities corresponding to the fault types when the node is the first type node according to the second fault information of each node obtained according to 1701-1704 to obtain a first fault probability corresponding to each node, and adding the probabilities corresponding to all the fault types when the node is the second type node to obtain a second fault probability corresponding to each node.

For ease of understanding, the node O, P, Q is illustrated as an example. Illustratively, assume that the second fault information corresponding to node O is as shown in table 7 below:

TABLE 7 second failure information for node O

The second fault information corresponding to the node P is shown in the following table 8:

table 8 second fault information corresponding to node P

The second fault information corresponding to node Q is shown in table 9 below:

table 9 second fault information corresponding to node Q

According to tables 7 to 9 above, the first failure probability and the second failure probability of the node O, P, Q are calculated. In some implementations, since the fault probabilities corresponding to the fault types when a node is a first type node are directly summed, the result may exceed 1, and similarly, the fault probabilities corresponding to the fault types when a node is a second type node may also exceed 1. Therefore, in order to facilitate calculation, the sum of the fault probabilities corresponding to the fault types when a certain node is a first type node and the sum of the fault probabilities corresponding to the fault types when a certain node is a second type node can be adjusted so that the sum of the fault probabilities is 1. Specific adjustment modes the present application is not limited.

Illustratively, the first and second probabilities of failure of node O are shown in table 10 below:

Table 10 first and second failure probabilities of node O

The first failure probability and the second failure probability of the node P are shown in the following table 11:

table 11 first failure probability and second failure probability of node P

The first failure probability and the second failure probability of the node Q are shown in the following table 12:

table 12 first and second failure probabilities of node Q

(2) And constructing a relation pointing graph among the nodes according to the second fault probability of the nodes. That is, since the second failure probability of each node is the sum of the failure probabilities of the respective failure types corresponding to each node in the second class node (i.e., the affected node), the larger the second failure probability of a certain node, the more likely the node is to be affected by the failure of other nodes to fail. Based on this, the relationship orientation graph between the nodes is obtained by using the orientation relationship between the node with the large second failure probability and the node with the small second failure probability (the node with the large second failure probability points to the node with the small second failure probability) among the nodes.

Illustratively, continuing with the example of node O, node P, and node Q, the relationship graph between node O, node P, and node Q may be as shown in fig. 18, with node O (second failure probability of 0.6) pointing to node P (second failure probability of 0.5) and node Q (second failure probability of 0.2), and node O (second failure probability of 0.6) pointing to node P (second failure probability of 0.5).

It will be appreciated that the above-described nodes O, P, and Q are merely exemplary, and that in implementations of the present application, the relationship-oriented graph includes oriented relationships between nodes in a clustered network.

It can be understood that, in some implementations of the present application, a relationship pointing graph between the nodes may also be constructed according to the first failure probability between the nodes, that is, the node with the small first failure probability points to the node with the large first failure probability, where the pointing relationship of the nodes in the relationship pointing graph should be opposite to the pointing relationship shown in fig. 18.

(3) And (3) starting from any node by using a random walk algorithm, traversing each node, and counting the nodes with the arrival times exceeding the time threshold in the walk process.

It can be understood that, since the pointing relationship in the relationship pointing graph is that the node with the high second failure probability points to the node with the low second failure probability, in the random walk process, the first failure probability of each node is used as the probability of switching to each node during the random walk, and the node with the highest arrival or switching times in the random walk process, that is, the node with the lowest second failure probability (the node which is least likely to be failed by other nodes), namely the root node (the first class node).

For example, continuing with the relationship-oriented graph formed by the nodes O, P, Q shown in fig. 18, assuming that the random walk starts from the node P, the node that arrives next is more likely to be the node Q whose first failure probability is larger (second failure probability is smaller).

It will be appreciated that the hops in the foregoing walk process are random, and thus require statistics of nodes whose number of arrivals exceeds a threshold number of times in the walk process, and then determine from these nodes the node that is most likely to be the root node of the cluster network that has failed. The frequency threshold is an empirical value or an experimental value, and may be determined according to the number of nodes in the cluster network, which is not limited in this application.

(4) And determining a first type node of the cluster network, which has the fault, from the nodes with the arrival times exceeding the time threshold, and taking the fault type with the maximum fault probability corresponding to the first type node as the fault type of the cluster network.

In some implementations, a node with the highest arrival times may be selected from nodes with arrival times exceeding the threshold of times as a first type node of the cluster network where the fault occurs.

And then selecting the fault type with the highest fault probability from the fault types corresponding to the first class node as the fault type of the cluster network.

For example, assuming that the first type node of the cluster network in which the fault occurs according to the above steps (1) to (3) is the above node Q, it can be seen from the above table 9 that when the node Q is the first type node, the fault type with the highest fault probability is the second type fault, and thus the fault type of the cluster network is the second fault type.

In the method, a random walk algorithm is utilized to carry out random walk according to a relation direction diagram formed by the first fault probability or the second fault probability of each node. And then, based on the pointing relation in the relation pointing graph (the node with the second high fault probability points to the node with the second low fault probability or the node with the first high fault probability points to the node with the first high fault probability), obtaining the node with the arrival times exceeding the threshold value of the times in the random walk process, selecting the node with the highest arrival times from the nodes as the root cause node of the cluster network, and taking the fault type with the highest fault probability as the fault type of the cluster network when the node is the first type node. It can be understood that the fault type and root cause node of the cluster network can be obtained more quickly by using the random walk algorithm, so that the efficiency of fault detection of the cluster network is improved.

Fig. 20 shows a block diagram of an electronic device in an embodiment of the application. In one embodiment, an electronic device can include one or more processors 2004, system control logic 2008 coupled to at least one of the processors 2004, system memory 2012 coupled to the system control logic 2008, non-volatile memory (NVM) 2020 coupled to the system control logic 2008, and a network interface 2020 coupled to the system control logic 2008.

In some embodiments, the processor 2004 may include one or more single-core or multi-core processors. In some embodiments, the processor 2004 may include any combination of general-purpose and special-purpose processors (e.g., graphics processor, application processor, baseband processor, etc.).

In some embodiments, system control logic 2008 may include any suitable interface controller to provide any suitable interface to at least one of processors 2004 and/or any suitable device or component in communication with system control logic 2008.

In some embodiments, system control logic 2008 may include one or more memory controllers to provide interfaces to system memory 2012. The system memory 2012 may be used for loading and storing data and/or instructions. The memory 2012 of the electronic device in some embodiments may include any suitable volatile memory, such as suitable dynamic random access memory (Dynamic Random Access Memory, DRAM).

NVM/memory 2020 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, NVM/memory 2020 may include any suitable nonvolatile memory such as flash memory and/or any suitable nonvolatile storage device, such as at least one of a Hard Disk Drive (HDD), compact Disc (CD) Drive, digital versatile Disc (Digital Versatile Disc, DVD) Drive.

NVM/memory 2020 may include a portion of a storage resource on the apparatus on which the electronic device is installed or it may be accessed by, but not necessarily part of, the device. For example, NVM/storage 2020 may be accessed over a network via network interface 920.

In particular, system memory 2012 and NVM/storage 2020 can each include: a temporary copy and a permanent copy of instructions 924. The instructions 2024 may include: instructions that, when executed by at least one of the processors 2004, cause the electronic device to implement the above-described construction method. In some embodiments, instructions 2024, hardware, firmware, and/or software components thereof may additionally/alternatively be disposed in system control logic 2008, network interface 2020, and/or processor 2004.

The network interface 2020 may include a transceiver to provide a radio interface for electronic devices to communicate with any other suitable device (e.g., front end module, antenna, etc.) over one or more networks. In some embodiments, the network interface 2020 may be integrated with other components of the electronic device. For example, the network interface 2020 may be integrated into at least one of the processor 2004, the system memory 2012, the nvm/storage 2020, and a firmware device (not shown) having instructions which, when executed by at least one of the processor 2004, implement the methods shown in fig. 4, 6, 14, 17 described above.

Network interface 2020 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 2020 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.

In one embodiment, at least one of the processors 904 may be packaged together with logic for one or more controllers of the system control logic 2008 to form a system package (system in package, siP). In one embodiment, at least one of the processors 2004 may be integrated on the same die with logic for one or more controllers of the system control logic 2008 to form a system-on-chip (SoC).

The electronic device may further include: input/output (I/O) device 2092. The I/O device 2092 may include a user interface to enable a user to interact with the electronic device; the design of the peripheral component interface enables the peripheral component to also interact with the electronic device. In some embodiments, the electronic device further comprises a sensor for determining at least one of environmental conditions and location information associated with the electronic device.

In some embodiments, the peripheral component interface may include, but is not limited to, a non-volatile memory port, an audio jack, and a power interface.

The embodiment of the application also provides electronic equipment, which comprises: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the various method embodiments described above.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps that may implement the various method embodiments described above.

Embodiments of the present application provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that may be performed in the various method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow in the methods of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program may implement the steps of each method embodiment described above when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer memory, read-only memory (ROM), random access memory (random access memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In the description above, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted in context as "when …" or "upon" or "in response to determining" or "in response to detecting". Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A fault detection method applied to a plurality of nodes in a clustered network, comprising:

acquiring real-time monitoring information of a first node in a cluster network, and selecting abnormal real-time monitoring information from the real-time monitoring information;

inputting the abnormal real-time monitoring information of the first node into a fault detection model to obtain first fault information of the first node, wherein the first fault information comprises the estimated fault type of the first node and fault probabilities corresponding to the estimated fault types;

and determining the real fault type of the cluster network according to the estimated fault type of the first node and the fault probability corresponding to each estimated fault type.

2. The method of claim 1, wherein the first fault information further comprises a type of speculative fault when the first node is a first type of node, a probability of fault corresponding to each type of speculative fault, and a type of speculative fault when the first node is a second type of node, a probability of fault corresponding to each type of speculative fault,

the method further comprises the steps of:

and correspondingly increasing the fault probability of the same speculated fault type of the node under the condition of the first type node by utilizing the fault probability of each speculated fault type corresponding to the second node as the second type node to obtain second fault information of the first node, wherein the second node and the first node are adjacent nodes, and the second fault information comprises the speculated fault type, the fault probability corresponding to each speculated fault type and the fault probability corresponding to each speculated fault type when the adjusted first node is the first type node.

3. The method according to claim 2, wherein determining the fault type of the clustered network according to the inferred fault type and the fault probability corresponding to each fault type comprises:

And taking the fault type corresponding to which the fault probability is larger than a first threshold value as the fault type of the cluster network according to the estimated fault type when the adjusted first node is the first type node and the fault probability corresponding to each estimated fault type.

4. The method according to claim 1, wherein the obtaining abnormal real-time monitoring information of the first node according to the real-time monitoring information includes:

5. The method of claim 1, wherein the fault detection model is trained using historical anomaly monitoring information for each node, the historical anomaly monitoring information for each node including a type of fault that occurred when each node was a first type of node and a second type of node, respectively, and monitoring information corresponding to each type of fault.

6. The method of claim 5, wherein the historical anomaly monitoring information for each node is determined by:

Acquiring historical monitoring information of each node in a cluster network within a preset duration, comparing the historical monitoring information of each node with a second preset condition, and removing monitoring information meeting the second preset condition from the historical monitoring information to obtain historical abnormal monitoring information of each node, wherein the historical monitoring information of each node comprises monitoring information corresponding to the case that each node is a first type node and a second type node respectively.

7. A model training method applied to an electronic device, comprising:

acquiring history monitoring information of each node in a cluster network within a preset duration, and selecting history abnormal monitoring information corresponding to each node from the history monitoring information of each node;

and training an initial fault detection model by utilizing the historical abnormal monitoring information of each node and the fault type of the historical abnormal monitoring information of each corresponding node to obtain a fault detection model.

8. The method of claim 7, wherein the historical anomaly monitoring information for each node includes a type of fault that occurred and monitoring information corresponding to each type of fault when each node is a first type node and a second type node, respectively.

9. The method according to claim 7 or 8, characterized in that the method further comprises:

acquiring real-time monitoring information of a first node in a cluster network, and acquiring abnormal real-time monitoring information of the first node according to the real-time monitoring information;

inputting the abnormality monitoring information of the first node into a fault detection model to obtain first fault information of the first node, wherein the first fault information comprises the estimated fault type of the first node and fault probabilities corresponding to the estimated fault types, and

and determining the true fault type of the cluster network according to the fault type speculated by the first node and the fault probability corresponding to each speculated fault type.

10. The method of claim 7, wherein the initial fault detection model comprises at least any one of:

a convolutional neural network model, a fully-connected neural network model, or a feed-forward neural network model.

11. An electronic device, the electronic device comprising:

a memory for storing instructions for execution by one or more processors of the electronic device, an

A processor, being one of the processors of an electronic device, for performing the method of any of claims 1 to 6.

12. An electronic device, the electronic device comprising:

A processor, being one of the processors of an electronic device, for performing the method of claims 7 to 10.

13. A computer readable storage medium having stored thereon instructions that, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1 to 6.

14. A computer readable storage medium having stored thereon instructions which, when executed on an electronic device, cause the electronic device to perform the method of claims 7 to 10.