CN113098723B

CN113098723B - Fault root cause positioning method and device, storage medium and equipment

Info

Publication number: CN113098723B
Application number: CN202110629066.7A
Authority: CN
Inventors: 饶思哲
Original assignee: Xinhuasan Artificial Intelligence Technology Co ltd
Current assignee: Xinhuasan Artificial Intelligence Technology Co ltd
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-09-17
Anticipated expiration: 2041-06-07
Also published as: CN113098723A

Abstract

In the method, when a fault alarm is received, a fault guidance graph is constructed based on abnormal indexes in a period of time before and after the fault alarm occurs, the matching degree between the fault guidance graph and a fault uncertain graph which is established and corresponds to a fault case with the determined fault root cause is determined, and the fault root cause of the fault alarm is determined according to the matching degree. Because the fault uncertain graph of the fault case is generated by combining a plurality of fault derivation graphs generated under different application scenes of the same fault alarm, all scenes in which the fault alarm possibly occurs are covered, the influence of uncertain factors such as network delay, jitter, equipment abnormity and the like on root cause positioning is greatly reduced, and the fault tolerance is stronger.

Description

Fault root cause positioning method and device, storage medium and equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a storage medium, and a device for locating a fault root cause.

Background

The root cause positioning of the fault is an important link for ensuring the reliability and the safety of the network system, and when a certain node in the network system has a fault, the root cause positioning of the fault must be quickly realized, so that the network system can be effectively recovered. In the related art, when a fault occurs newly, the network system compares the new fault with the fault in the case library, so that the root cause corresponding to the most similar fault is taken as output, and the root cause positioning of the fault is completed. In the case base, the faults analyzed by the manager, the corresponding root causes and the related information are all determined, however, in practical application, due to uncertain factors such as network delay, jitter, equipment and the like, the situation that all or part of abnormal indexes of some equipment are reported by mistake or are not reported by mistake may exist, and the situation can greatly affect the comparison result, thereby causing the positioning error of the root causes.

Disclosure of Invention

To overcome the problems in the related art, the present specification provides a method, an apparatus, a storage medium, and a device for locating a fault root cause.

According to a first aspect of embodiments of the present specification, there is provided a fault root cause locating method, including:

when a fault alarm is received, acquiring an abnormal index causing the fault alarm; the abnormal indexes comprise abnormal indexes of at least one device in a first specified time period before the fault alarm occurs and abnormal indexes of at least one device in a second specified time period after the fault alarm occurs;

constructing a fault guidance diagram according to the acquired abnormal indexes; two nodes on any edge with a direction in the fault guidance diagram represent two abnormal indexes in the obtained abnormal indexes, any edge with a direction is used for indicating the incidence relation between the abnormal indexes represented by the two nodes on the edge, and the probability on any edge is used for representing the probability of generating another abnormal index caused by one abnormal index;

determining a fault root factor corresponding to the fault alarm according to the matching degree between the fault guidance graph and a fault uncertain graph corresponding to the fault case in the established fault case library; the fault uncertainty map of each fault case in the fault case library corresponds to the determined fault root cause, and the fault uncertainty map of each fault case in the fault case library is generated by combining a plurality of fault guidance maps generated under different application scenes and giving an alarm for the same fault.

In some examples, the constructing the fault guidance diagram according to the obtained abnormal indexes includes:

taking the fault alarm as an initial node of the fault guidance graph;

matching the abnormal index of at least one device in a first specified time period before the fault alarm occurs and the abnormal index of at least one device in a second specified time period after the fault alarm occurs according to a pre-specified rule to obtain a middle node and a leaf node of the fault derivation graph so as to construct the fault derivation graph; the pre-specified rule is used for determining the association relation existing among the abnormal indexes and the probability of the association relation.

In some examples, before determining the fault root corresponding to the fault alarm according to the matching degree between the fault guidance graph and the fault uncertainty graph corresponding to the fault case in the established fault case library, the method further includes: obtaining the corresponding characteristics of the failure uncertain graph of each failure case in the failure case library; the characteristics corresponding to the fault uncertainty map comprise a adjacency matrix of the fault uncertainty map;

the determining the fault root factor corresponding to the fault alarm according to the matching degree between the fault guidance graph and the fault uncertain graph corresponding to the fault case in the established fault case library comprises:

coding the fault guidance diagram and determining the fault characteristics corresponding to the fault guidance diagram according to an information transmission mode;

splicing the fault characteristics with the characteristics corresponding to the fault uncertain graphs of the fault cases in the obtained fault case library to obtain spliced characteristics, and inputting the spliced characteristics into a trained matching degree prediction model to obtain the matching degree between the fault guidance graph and the fault uncertain graphs;

and determining the fault root cause corresponding to the fault alarm based on the determined fault root cause corresponding to the fault uncertain graph with the matching degree meeting the specified conditions.

In some examples, the encoding the fault guidance diagram and determining the fault characteristics corresponding to the fault guidance diagram according to the information transmission manner includes:

coding the fault guidance diagram in a preset characteristic coding mode to obtain initial characteristics of each node in the fault guidance diagram; the preset feature coding mode comprises the steps of coding the equipment position and the abnormal type of the abnormal index corresponding to the node;

and gradually transmitting the information of the initial characteristics of the leaf nodes in the fault derivative graph to the initial nodes in a bottom-up mode, and taking the transmitted characteristics of the initial nodes as the fault characteristics corresponding to the fault derivative graph.

In some examples, the features of the fault uncertainty map are derived based on a second-order adjacency matrix of the fault uncertainty map; and the second-order adjacency matrix of the fault uncertainty map is obtained based on calculation of all possible expected values contained in the fault uncertainty map.

In some examples, the matching degree prediction model is obtained by training based on positive sample features and negative sample features, the positive sample features are obtained by splicing the features of the fault uncertainty maps of the fault cases with the features of sub-maps sampled from the same fault uncertainty maps, and the negative sample features are obtained by splicing the features of the fault uncertainty maps of the fault cases with the features of the sub-maps sampled from different fault uncertainty maps; the node set and the edge set of the subgraph are respectively subsets of the node set and the edge set of the uncertain fault graph; the obtaining mode of the characteristics of the subgraph is the same as the obtaining mode of the fault characteristics of the fault derivation graph.

In some examples, the fault root cause corresponding to the fault alarm is determined based on the fault root corresponding to the fault uncertainty map with the highest matching degree in the fault uncertainty maps; and if the fault uncertain graph with the highest matching degree has a plurality of fault root causes, determining the corresponding fault root cause of the fault alarm based on the priority of each fault root cause of the fault uncertain graph.

According to a second aspect of embodiments herein, there is provided a fault root cause locating device, including:

the index acquisition module is used for acquiring an abnormal index causing the fault alarm when the fault alarm is received; the abnormal indexes comprise abnormal indexes of at least one device in a first specified time period before the fault alarm occurs and abnormal indexes of at least one device in a second specified time period after the fault alarm occurs;

the fault derivation module is used for constructing a fault derivation graph according to the acquired abnormal indexes; two nodes on any edge with a direction in the fault guidance diagram represent two abnormal indexes in the obtained abnormal indexes, any edge with a direction is used for indicating the incidence relation between the abnormal indexes represented by the two nodes on the edge, and the probability on any edge is used for representing the probability of generating another abnormal index caused by one abnormal index;

a root cause positioning module, configured to determine a fault root cause corresponding to the fault alarm according to a matching degree between the fault guidance graph and a fault uncertainty graph corresponding to a fault case in the established fault case library; the fault uncertainty map of each fault case in the fault case library corresponds to the determined fault root cause, and the fault uncertainty map of each fault case in the fault case library is generated by combining a plurality of fault guidance maps generated under different application scenes and giving an alarm for the same fault.

In some examples, the index obtaining module is specifically configured to:

taking the fault alarm as an initial node of the fault guidance graph;

In some examples, the apparatus further comprises:

the case characteristic module is used for acquiring the characteristics corresponding to the failure uncertain graph of each failure case in the failure case library; the characteristics corresponding to the fault uncertainty map comprise a adjacency matrix of the fault uncertainty map;

the root cause location module includes:

the determining submodule is used for encoding the fault guidance diagram and determining the fault characteristics corresponding to the fault guidance diagram according to an information transmission mode;

the splicing submodule is used for splicing the fault characteristics with the characteristics corresponding to the fault uncertain graphs of the fault cases in the obtained fault case library to obtain splicing characteristics, and inputting the splicing characteristics into a trained matching degree prediction model to obtain the matching degree between the fault guidance graph and the fault uncertain graphs;

and the positioning sub-module is used for determining the fault root cause corresponding to the fault alarm based on the determined fault root cause corresponding to the fault uncertain graph with the matching degree meeting the specified conditions.

In some examples, the determining sub-module is specifically configured to:

According to a third aspect of embodiments of the present specification, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs any one of the methods of the embodiments of the specification.

According to a fourth aspect of embodiments herein, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements any of the methods in the embodiments herein when executing the program.

The technical scheme provided by the embodiment of the specification can have the following beneficial effects:

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

FIG. 1 is a flow chart illustrating a method for fault root location according to an exemplary embodiment of the present description;

FIG. 2 is a schematic illustration of an uncertainty map of a historical failure shown in the present specification in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a process for top-left justified 0 padding an adjacency matrix according to an exemplary embodiment;

FIG. 4 is a schematic diagram of a derivative of a new fault shown in the present specification in accordance with an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a partial sub-graph resulting from sampling an uncertainty graph of the historical fault illustrated in FIG. 2 according to an exemplary embodiment;

FIG. 6 is a hardware block diagram of a computer device in which a fault root cause location device is shown in accordance with an exemplary embodiment of the present description;

fig. 7 is a block diagram of a fault root cause location apparatus shown in accordance with an exemplary embodiment of the present description.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The root cause positioning of the fault is an important link for ensuring the reliability and the safety of the network system, and when a certain node in the network system has a fault, the root cause positioning of the fault must be quickly realized, so that the network system can be effectively recovered. The existing network system generally has a complete log system in operation, and the log system records the operation state, operation record, alarm state, alarm recovery condition and the like of the network system in real time and provides data for analysis for management personnel. In the related art, when a fault occurs newly, the network system compares the new fault with the fault in the case library, so that the root cause corresponding to the most similar fault is taken as output, and the root cause positioning of the fault is completed. In the case base, the faults analyzed by the manager, the corresponding root causes and the related information are all determined, however, in practical application, due to uncertain factors such as network delay, jitter, equipment and the like, the situation that all or part of abnormal indexes of some equipment are reported by mistake or are not reported by mistake may exist, and the situation can greatly affect the comparison result, thereby causing the positioning error of the root causes. For example, one case in the case library is the case that the generation of the abnormal indicator A, B, C causes the fault D, the root of the fault D is the abnormal indicator a, but in practical application, due to network delay, the network system does not receive the abnormal indicator a within the time span of root cause positioning, that is, the network system only receives the abnormal indicator B, C, D, so that when the network system compares with the fault case in the case library based on the abnormal indicator B, C, D of the new fault, a situation that the similarity between the abnormal indicator B, C, D and the fault case containing the abnormal indicator A, B, C, D is lower than that between the fault case and the case of other fault types may occur, for example, the most similar fault case that may be identified is the case corresponding to the fault E, which results in that the root cause that is finally positioned is the root cause corresponding to the fault E, which easily affects subsequent recovery of the network system, causing immeasurable losses. Based on this, the present specification provides a fault root cause location scheme to solve the above problems.

The following provides a detailed description of examples of the present specification.

The method for locating the fault root cause in the embodiments of the present description may be applied to communication networks in various scenarios such as industry, finance, enterprise, and the like, where the communication network may be composed of a plurality of devices, and the method may be applied to a device that is specially responsible for handling fault alarms, or may be applied to any other device, and the present description does not limit this.

As shown in fig. 1, fig. 1 is a flowchart illustrating a fault root cause locating method according to an exemplary embodiment, where the method includes:

in step 101, when a fault alarm is received, acquiring an abnormal index causing the fault alarm; the abnormal indexes comprise abnormal indexes of at least one device in a first specified time period before the fault alarm occurs and abnormal indexes of at least one device in a second specified time period after the fault alarm occurs;

generally, when the number of indicators in which an abnormality occurs exceeds an alarm threshold, it is determined that a new fault has occurred, and a fault alarm is given depending on the fault condition. The alarm threshold may be set according to a specific networking complexity, and may also be set according to experience of operation and maintenance personnel, which is not limited in this specification. In a fault, a plurality of abnormal indexes may be involved, where an abnormal index corresponding to a fault alarm is a fault, and an abnormal index causing the highest influence degree of the fault is a root cause, that is, in root cause positioning of the fault, the fault and the root cause are both abnormal indexes, but the abnormal index is not necessarily a fault or a root cause, for example, an abnormal index a causes an abnormal index B, an abnormal index B causes an abnormal index C, and the abnormal index C is an abnormal index corresponding to the fault alarm, then the abnormal index C is a fault, and the abnormal index a is a root cause of the fault.

The root cause positioning of the fault refers to determining the most root cause causing the fault in a plurality of abnormal indexes corresponding to the fault, so as to facilitate the repair of the fault. In some examples, the abnormal index may include a TCP protocol related index reflecting the network health of the device, a hardware related index reflecting the health of the device itself, and the like, where the TCP protocol related index includes a network rate, a bandwidth, a throughput, a TCP connection number, and the like, and the hardware related index includes a CPU usage rate, a memory usage rate, a disk space usage rate, and the like. It should be noted that the abnormality index in the present embodiment may include some indexes generated by the device when it is determined that a new fault occurs, in addition to the performance index detected by the device. For example, the device may detect multiple performance indicators related to the CPU, and set the alarm threshold to 1, when detecting that both the CPU load rate and the CPU utilization rate are abnormal, the device determines that a new fault occurs, and at this time, the device may generate a new abnormal indicator "CPU abnormal" as fault alarm information to report, and of course, the device may also correspondingly generate other abnormal indicators according to specific conditions of the detected abnormal indicators. In addition, if the solution of this embodiment is applied to a certain device in a communication network, and the device monitors performance conditions of all devices in the communication network, the obtained abnormal index may include abnormal indexes generated on other devices in addition to the abnormal index generated by the device itself.

In practical application, reporting of the abnormal index needs to consume a certain time, so that the acquisition time of the abnormal index related to the new fault and the determination time of the fault alarm of the new fault may have a certain time difference. Based on the above, the obtained abnormal indexes associated with the new fault include the abnormal index of at least one device in a first specified time period before the fault alarm occurs, and the abnormal index of at least one device in a second specified time period after the fault alarm occurs, namely the abnormal index in a preset time threshold before and after the new fault occurs. The first designated time period/the second designated time period may be 2min, or may also be 1min, 3min, and so on, and the first designated time period and the second designated time period may be the same or different, which is not limited in this embodiment. It should be noted that the set preset time threshold is adapted to the requirements of the actual scene, and the set preset time threshold is too long, so that many unrelated abnormal indexes are easily obtained, and the efficiency of root cause positioning is affected; the set preset time threshold is too short, so that abnormal indexes with strong correlation can be easily missed, and the accuracy of root cause positioning is influenced.

In step 102, a fault guidance diagram is constructed according to the acquired abnormal indexes; two nodes on any edge with a direction in the fault guidance diagram represent two abnormal indexes in the obtained abnormal indexes, any edge with a direction is used for indicating the incidence relation between the abnormal indexes represented by the two nodes on the edge, and the probability on any edge is used for representing the probability of generating another abnormal index caused by one abnormal index;

in this step, a fault derivation graph is used to represent the association relationship between the abnormal indexes corresponding to the new fault. A graph is a structure used to describe a set of objects that correspond to vertices, also called nodes, of the graph, while edges of the graph represent each associated pair of vertices. In this embodiment, the nodes of the fault guidance graph are the abnormal indexes corresponding to the new fault, the edges of the fault guidance graph represent the abnormal index pairs having the association relationship among the plurality of abnormal indexes, and different from the ordinary graph, the edges of the fault guidance graph have a value, and the value of the edge represents the probability of the association relationship between two abnormal indexes, which may also be regarded as a weight.

In some examples, the value of this edge may be determined based on preset rules in a rule base, and the rule base may be established by an operation and maintenance expert by analyzing and summarizing a large number of faults recorded in a history log and relevant data such as abnormal indexes thereof, and obtaining an association between a series of abnormal indexes, a derivation relationship, and a probability value of each pair of relationships. The preset rules may be structurally stored in the rule base in a specific manner, for example, for a preset rule that records a relationship between a pair of abnormal indexes, a name of an abnormality, a condition between devices that the rule needs to satisfy when the rule takes effect, an existence probability of the rule, a maximum time span that the rule needs to satisfy when the rule takes effect, and the like are sequentially recorded to form a record. Taking the example that MEMORY exception has a 60% probability of causing a CPU exception, this preset rule may be stored as "MEMORY _ any (any _ a) - > CPU _ any (any _ B), satisfy: the method includes the following steps of (1) providing an index _ a.device = = index _ b.device, providing =60%, and time _ range =60 ", where satisfy represents a condition between devices that the preset rule needs to meet, and time _ range represents a maximum time span (in seconds) that the preset rule needs to meet when the preset rule takes effect, that is, if a time difference between two abnormal indexes exceeds the maximum time span, it represents that the two abnormal indexes are not related. In this way, a plurality of rules may be stored in a file in a Json or tabular format or stored in a database in a structured form to form a rule base for easy access. That is to say, when it is determined that a new fault occurs, a plurality of abnormal indexes corresponding to the new fault may be matched according to a preset rule in the rule base to obtain a probability of an association relationship between the plurality of abnormal indexes, and then a fault guidance map is generated. It should be noted that when any abnormal index is not matched according to the preset rules in the rule base, the abnormal index may be ignored, for example, 5 abnormal indexes corresponding to a new fault are A, B, C, D, E, where a is an abnormal index corresponding to a fault alarm, if there is no rule related to E and the remaining four abnormal indexes in the rule base, it indicates that E is an abnormal index that cannot be matched, and the maximum probability is unrelated to the current fault alarm, so that the abnormal index may be ignored, and a fault guidance diagram is constructed only according to the abnormal index A, B, C, D.

The fault derivation map of the present embodiment is a directed map, that is, the correlation between the abnormal indexes in the present embodiment is directed, and can be regarded as a causal relationship. For example, when the device detects that two indexes, namely "total amount of free memory" and "amount of cached memory", are abnormal, it determines that a new fault occurs, and at this time, the device generates a new abnormal index, "memory abnormal", which is reported as a fault alarm, and then there is an association relationship between the abnormal index "total amount of free memory" and the abnormal index "memory abnormal", and the association relationship is that the abnormal index "total amount of free memory" causes the abnormal index "memory abnormal"; similarly, the abnormal indicator "amount of cached memory" and the abnormal indicator "memory abnormal" have an association relationship, and the association relationship is that the abnormal indicator "amount of cached memory" causes the abnormal indicator "memory abnormal". Moreover, since different abnormal indexes correspond to different services and the influence degrees generated by the different abnormal indexes are different, and therefore, different association relations have corresponding probabilities, when the new failure derived graph is generated, if one rule in the preset rules is that the probability that 80% of the abnormal index "total amount of free memory" causes the abnormal index "memory abnormal", and the probability that 60% of the other rule of the abnormal index "amount of cached memory" causes the abnormal index "memory abnormal", it can be determined that the probabilities of the two aforementioned association relations are 80% and 60% in sequence, and then the values on the corresponding edges of the node pair ("total amount of free memory", "memory abnormal") and the node pair ("amount of cached memory", "memory abnormal") in the derived graph are 0.8 and 0.6, respectively. For example, the direction of the edge of the node pair (a, B) in the fault derivative graph is from node a to node B, where node a is a node facing back above the edge, and node B is a node facing in the direction above the edge, so node a is an upstream node of node B, and node B is a downstream node of node a, and the probability of the edge is a probability that the abnormal index represented by node a causes generation of the abnormal index represented by node B.

In addition, when the fault derivation map is constructed according to the acquired abnormal indexes, the fault alarm can be used as an initial node of the fault derivation map, and then according to a pre-specified rule, the abnormal index of at least one device in a first specified time period before the fault alarm occurs and the abnormal index of at least one device in a second specified time period after the fault alarm occurs are matched to obtain a middle node and a leaf node in the fault derivation map, so that the fault derivation map is constructed. An initial node is a node without a downstream node; the intermediate node is a node having both an upstream node and a downstream node; a leaf node is a node that has no upstream nodes. In this embodiment, the edge between the initial node and the intermediate node, the edge between the initial node and the leaf node, the edge between the intermediate node and the intermediate node, and the edge between the intermediate node and the leaf node are obtained based on a pre-specified rule, where the pre-specified rule may be a preset rule in the aforementioned rule base, and through matching of the rule base, the association relationship existing between the abnormal indexes corresponding to the fault alarm and the probability of each association relationship may be determined, so as to obtain the values of the edge and the edge between the nodes.

In step 103, determining a fault root corresponding to the fault alarm according to the matching degree between the fault guidance graph and a fault uncertain graph corresponding to the fault case in the established fault case library; the fault uncertainty map of each fault case in the fault case library corresponds to the determined fault root cause, and the fault uncertainty map of each fault case in the fault case library is generated by combining a plurality of fault guidance maps generated under different application scenes and giving an alarm for the same fault.

The fault case library of the embodiment may be established based on the result of analyzing and summarizing the information related to the fault processing that has occurred in the past by the operation and maintenance experts. The same fault may correspond to different root causes in different scenarios, for example, a fault type is CPU load is high, and the fault may be caused by abnormal CPU utilization or abnormal CPU idle time scale. For a plurality of scenes in which the same fault may occur, the case base of the related art is stored as a plurality of cases based on the difference of occurrence backgrounds and the difference of roots. In the fault case library of this embodiment, one fault case records a plurality of scenes in which a fault type may occur, and the fault uncertainty map of each fault case is generated by merging a plurality of fault guidance maps generated in different application scenes for the same fault alarm, that is, in this embodiment, all the scenes in which the same fault may occur are merged to generate one fault uncertainty map.

An uncertainty map is a graph with uncertainty, and each edge of the uncertainty map usually has a real number within [0, 1], called the probability of existence, indicating the probability of the edge actually existing. The uncertainty map of the fault in this step is a derivative map with uncertainty, where the uncertainty is derived from the fact that the uncertainty map of the fault includes all the scenarios where the same fault may occur, and therefore, some edges in the uncertainty map of the fault are not necessarily present in practical situations. In the failure uncertainty graph, nodes are all abnormal indexes corresponding to the failure case, and the value of an edge of each node pair is the probability of possible association relation among some abnormal indexes in the abnormal indexes. For the construction of the fault uncertainty graph, it can be to take the fault alarm as the initial node of the graph,

and matching abnormal indexes in a specified time period before and after the fault alarm occurs in a scene according to rules in the rule base to generate fault guidance diagrams, wherein each fault guidance diagram corresponds to one scene, and merging the fault guidance diagrams based on the same fault alarm so as to form a fault uncertain diagram. For example, for a fault alarm of "CPU exception", it may actually correspond to two application scenarios, one is caused by high CPU load, and the other is caused by high CPU utilization, in this embodiment, a fault guidance diagram is generated for the two application scenarios, respectively, and then the two fault guidance diagrams are merged based on the fault alarm, so as to generate a fault uncertainty diagram corresponding to the fault alarm of "CPU exception", and in the fault uncertainty diagram, two derivation paths from "CPU load high" to "CPU exception" and from "CPU utilization high" to "CPU exception" are shown.

Each fault case in the fault case library is analyzed and processed in advance, the fault uncertain graph of each fault case corresponds to the determined fault root cause, namely the fault root cause of the fault uncertain graph is analyzed and verified, and therefore the matching degree of the fault derivation graph of the new fault and the fault uncertain graph of each fault case can be used for determining the fault root cause of the new fault. The matching degree here may refer to a correlation degree between the failure guidance diagram of the new failure and the failure uncertainty diagram of the failure case, and may be determined based on respective node sets and edge sets of the two diagrams, for example, if the failure guidance diagram belongs to a sub-diagram of the failure uncertainty diagram of a certain failure case, that is, the node set and the edge set of the failure guidance diagram are respectively subsets of the node set and the edge set of the failure uncertainty diagram of the failure case, then the matching degree of the two diagrams is high at this time.

After the matching degree between the fault guidance diagram of the new fault and the fault uncertainty diagram of each fault case is determined, a derivation path can be searched in the fault uncertainty diagram according to each abnormal index corresponding to the new fault aiming at the fault uncertainty diagram with the highest matching degree, and then the fault root of the new fault is determined based on the fault root existing on the determined derivation path. Following the foregoing example, in a case where the abnormal indicator a is missed due to network delay, a derivation path may be determined in the failure uncertainty map with the highest matching degree based on the abnormal indicator B, C, D of the new failure, where the derivation path indicates that the abnormal indicator A, B, C causes the failure D and the root of the failure D is the abnormal indicator a, so that the root of the current new failure may be determined to be the abnormal indicator a, and a root positioning error caused by uncertain factors such as network delay, jitter, and device abnormality is avoided.

In the fault root cause method in the embodiments of the present description, when a fault alarm is received, a fault guidance graph is constructed based on abnormal indexes in a period of time before and after the fault alarm occurs, a matching degree between the fault guidance graph and an established fault uncertainty graph corresponding to a fault case with a determined fault root cause is determined, and then the fault root cause of the fault alarm is determined according to the matching degree. Because the fault uncertain graph of the fault case is generated based on at least one fault derivation graph corresponding to the same fault alarm possibly occurring in each application scene, all the scenes where the fault alarm possibly occurs are covered, the influence of uncertain factors such as network delay, jitter, equipment abnormity and the like on root cause positioning is greatly reduced, and the fault tolerance is stronger.

Further, in an optional embodiment, in order to facilitate comparison of matching degrees between the fault derivative graph and the fault uncertainty graph of each fault case, graph features may be extracted from the graph, and then the graph features may be compared by using a trained model. The features of the uncertain fault map of the fault case may be recorded in the fault case library, so that before step 103 is executed, the features corresponding to the uncertain fault map of each fault case in the fault case library may be obtained. The corresponding feature of the fault uncertainty map of the fault case may include an adjacency matrix of the fault uncertainty map, where the adjacency matrix is a matrix representing the adjacency relationship between vertices of the map, and generally, the dimension of rows and columns of the adjacency matrix is equal to the number of nodes of the map, and if the map has N vertices, the adjacency matrix of the map is an N-th-order square matrix, and the values in the square matrix are determined based on the weights of the edges between the vertex pairs. At this time, step 103 may include:

step 1031, encoding the fault guidance diagram and determining the fault characteristics corresponding to the fault guidance diagram according to an information transmission mode;

for the fault characteristics of the fault guidance diagram, the fault guidance diagram comprises a plurality of nodes and the incidence relation between the node pairs, so the fault characteristics of the fault guidance diagram are formed by the characteristics of each node, after the fault guidance diagram is coded, the characteristics of each node can be generated, and then the fault characteristics of the fault guidance diagram can be obtained in an information transmission mode. The method includes the steps that a coding mode adopted for a fault derivation graph can be a preset characteristic coding mode, nodes of the derivation graph are abnormal indexes of new faults, and the abnormal indexes are indexes occurring on certain equipment, so that the characteristics of the nodes are described by equipment positions and abnormal types of the abnormal indexes, and a large number of indexes on a large number of equipment are abnormal at the same time due to the fact that a single fault rarely causes the abnormality of the large number of indexes, so that the preset characteristic coding mode can include equipment coding for coding the equipment positions of the nodes based on a first constraint number and index coding for coding the abnormal types of the nodes based on a second constraint number. Here, the device code and the index code may be similar to One-hot encoding, assuming that the first constraint number is 5, that is, a maximum of 5 devices are constrained, if the node X in the derivative graph is an abnormal index occurring on the device a, the device code of the node X is (1, 0, 0, 0, 0), and if the node Y in the derivative graph is an abnormal index occurring on the device B, the device code of the node Y is (0, 1, 0, 0, 0, 0); assuming that the second constraint number is 3, i.e., the 3 common indicators are constrained to be the common indicator P, Q, S, if the node X is P-abnormal, the indicator of the node X is coded as (1, 0, 0), and if the node Y is Q-abnormal, the indicator of the node Y is coded as (0, 1, 0). And combining the equipment code and the index code of the node to obtain the characteristic of the node, wherein the dimension of the characteristic of the node is the sum of the first constraint number and the second constraint number. It should be noted that, in most scenarios, the number of devices causing device abnormality due to a single fault does not exceed 5, and therefore, the first constraint number may be set to 5; according to the actual data, 95 common abnormal indexes can be summarized, so that the second constraint number can be set to 95, and of course, the first constraint number and the second constraint number can be correspondingly set according to a specific scene, such as networking complexity.

Determining the fault characteristics corresponding to the fault guidance diagram according to the information transmission mode may refer to gradually transmitting the characteristic information of the leaf nodes in the fault guidance diagram to the initial node in a bottom-up mode, so that the transmitted characteristics of the initial node are used as the fault characteristics corresponding to the fault guidance diagram. That is, the fault characteristics of the fault guidance diagram may be represented by the characteristic information of the abnormality corresponding to the fault alarm and the characteristic information of all the sub-abnormalities included in the fault guidance diagram. The characteristics of each node obtained after encoding are regarded as initial characteristics, and the final characteristics of each node can be obtained in an information transmission mode, wherein the final characteristics of the leaf nodes are the initial characteristics of the leaf nodes, and the final characteristics of the intermediate nodes and the initial nodes can be calculated based on the following formula:

wherein the content of the first and second substances,

is the final characteristic of the target node;

is an initial characteristic of the target node;

is a collection of downstream nodes contained by the target node;

is the final characteristic of node u;

is a node pair

Probability of the corresponding edge.

Therefore, the final characteristics of the initial nodes can be calculated and obtained in a bottom-up mode, and therefore the fault characteristics of the fault uncertain graph are obtained. Of course, in other embodiments, the fault characteristics of the fault derivation map may also be generated by using the method of node2vec, DeepWalk, and the like, which is not limited in this specification.

Step 1032, splicing the fault features with features corresponding to the obtained fault uncertain graphs of the fault cases in the fault cases to obtain spliced features, and inputting the spliced features into a trained matching degree prediction model to obtain the matching degree between the fault guidance graphs and the fault uncertain graphs;

after the fault characteristics of the fault guidance diagram and the characteristics corresponding to each fault uncertain diagram are obtained, the dimensions of the two characteristics may be different, so that the fault characteristics of the fault guidance diagram and the characteristics corresponding to each fault uncertain diagram can be converted into the characteristics with the same dimensions and then added to obtain the splicing characteristics. For example, the fault feature of the fault guidance diagram may be a 100-dimensional feature, and the feature of the fault uncertainty diagram may be a 400-dimensional feature, and then the fault feature of the fault guidance diagram and the feature of the fault uncertainty diagram may be converted into 500-dimensional features and added to obtain a 500-dimensional concatenation feature, so that the concatenation feature may indicate the feature of the fault guidance diagram and the feature of the fault uncertainty diagram at the same time.

In order to improve the efficiency of root cause localization, a matching degree prediction model can be used to obtain the matching degree. The matching degree prediction model can be obtained based on machine learning algorithm training, and data processing can be realized more efficiently and accurately by using the model. The matching degree prediction model may be trained by using the previously-mentioned fault uncertainty maps of the established fault cases as samples, and further, considering the cases that the cases are few and the data amount is insufficient, which may cause model overfitting, in some examples, the matching degree prediction model may be obtained by training based on positive sample features and negative sample features, wherein the positive sample features are obtained by splicing the features of the fault uncertainty maps of the fault cases with the features of sub-graphs sampled from the same fault uncertainty map, and the negative sample features are obtained by splicing the features of the fault uncertainty maps of the fault cases with the features of sub-graphs sampled from different fault uncertainty maps; the node set and the edge set of the subgraph are respectively subsets of the node set and the edge set of the uncertain fault graph; the way of acquiring the features of the subgraph is the same as the way of acquiring the fault features of the fault derivative graph. That is, the data amount is amplified by sampling the uncertainty map of the fault case, and for the uncertainty map, it contains many kinds of map possibilities, so that a plurality of subgraphs representing different possibilities can be sampled from the uncertainty map, and then the model can be trained based on each uncertainty map and each subgraph, for example, subgraph a1 and subgraph a2 are sampled from uncertainty map a, and subgraph B1 and subgraph B2 are sampled from uncertainty map B, so that the features corresponding to (a, a 1), (a, a 2), (B, B1), (B, B2) are all positive sample features, and the features corresponding to (a, B1), (a, B2), (B, a 1), (B, A2) are all negative sample features. The positive sample characteristics correspond to a first matching label, such as the matching degree 1, and the negative sample characteristics correspond to a second matching label, such as the matching degree 0, so that the preset initial model is trained by taking the positive sample characteristics and the negative sample characteristics as training samples, and a trained matching degree prediction model can be obtained. The uncertain fault diagram of each fault case is subjected to data enhancement, so that the accuracy of the matching degree prediction model can be greatly improved. In addition, how to sample specifically can refer to a manner of sampling an uncertain graph in the related art, which is not described herein again.

When the matching degree prediction model is trained, an XGboost (Extreme Gradient Boosting) model can be used as a preset initial model, the XGboost model is a regression model and is originated from a Gradient Boosting framework, but the XGboost model is more efficient, can process multiple tasks such as regression, classification and sequencing, and has strong prediction performance and training speed, so that the XGboost model is suitable for application scenes of a specification scheme. The working principle of the XGBoost model can be expressed as: continuously performing feature splitting to generate a tree, and adding a tree each time, namely learning a new function to fit the residual error of the previous prediction; when training is completed to obtain K trees, if the score of a sample is to be predicted, a corresponding leaf node is actually dropped in each tree according to the characteristic of the sample, and each leaf node corresponds to a score; and accumulating and calculating the scores corresponding to each tree to obtain the predicted value of the sample. Specifically, a plurality of adjustable super parameters exist in the XGBoost model, for example, a mixture _ byte (which defines a ratio of column numbers of random samples of each tree), a learning rate, a max _ depth (which defines a size of the tree), an alpha (which defines a regularization term), an n _ estimator (which defines a number of trees), and the like, and when the XGBoost model is trained, a model parameter configuration capable of enabling the model to have a minimum rmse (root mean square error) is obtained by combining 5-fold cross validation (5-fold cross validation) on the super parameters in a grid search manner, and the model parameter configuration is used as a final parameter of the XGBoost model. Therefore, the XGboost model with the configured parameters is learned according to the aforementioned training samples, and the trained matching degree prediction model can be obtained. Of course, in other embodiments, other machine-learned regression models, such as linear regression models, logistic regression models, support vector machine regression models, etc., may be used as the predictive initial model; in addition, in the parameter adjustment optimization of the regression model, one or more of the following adjustment modes can be adopted: grid search, random search, Gaussian Process; alternatively, when the amount of data is large enough, the model may be trained based on a deep learning model, which is not limited in this specification. Due to the fact that the trained model is simple, flexible expansibility is achieved for the situations that new fault cases need to be inserted into the case base, and extra time is not needed to be consumed for retraining.

And 1033, determining the fault root cause corresponding to the fault alarm based on the determined fault root cause corresponding to the fault uncertain graph with the matching degree meeting the specified condition.

After the matching degree between the fault guidance graph and each fault uncertain graph is obtained, the target uncertain graph can be obtained through screening according to the specified conditions, so that a derivation path is searched in the target uncertain graph according to each abnormal index corresponding to the fault alarm, and the fault root corresponding to the fault alarm is determined based on the fault root existing on the determined derivation path. The uncertain fault map whose matching degree meets the specified condition may be the uncertain fault map whose matching degree is the highest, and in other embodiments, the uncertain fault map whose matching degree exceeds a preset value may also be the uncertain fault map.

In addition, since the failure uncertainty map is obtained by merging a plurality of scenes corresponding to the same failure alarm, there may be a plurality of determined failure roots corresponding to the failure uncertainty map, and based on this, if the failure uncertainty map with the highest matching degree has a plurality of failure roots, the failure root of the failure alarm is determined based on the priority of the determined failure root corresponding to the failure uncertainty map. That is, in the case where the failure uncertainty map has a plurality of determined root causes, the root causes of the historical failures are prioritized to output the most serious root cause as the root cause of the new failure. Here, the number of prioritized root causes is not necessarily the same as the number of all root causes of the historical failures, for example, when there are A, B, C, D, E root causes of the historical failure with the highest matching degree, but the anomaly indicators of the new failure are only A, B, C root causes matched on the derivation path corresponding to the uncertainty map of the historical failure, only the root causes A, B, C are prioritized, and the root causes D, E are not considered. In addition, the priority here is related to the nature of the anomaly indicator itself, for example, the priority of the root cause is higher on the lower layer, that is, the priority of the root cause can be positively correlated with the lower layer degree of the root cause, for example, the degree of influence of the anomaly of the server (such as TCP related indicator anomaly) is lower than the anomaly of the device (such as CPU, memory related indicator anomaly), therefore, the priority of the root cause corresponding to the anomaly of the server is lower than the root cause corresponding to the anomaly of the device; alternatively, the higher the severity level of the root cause, the higher the priority level of the root cause, the severity level herein may be determined by the value of the abnormality index, for example, if both the root cause a and the root cause B are abnormality indexes related to the memory, and if the ratio of the index value of the root cause a to the threshold value a 'set for the index is 200% and the ratio of the index value of the root cause B to the threshold value B' set for the index is 120%, the severity level of the root cause a is higher than the severity level of the root cause B, that is, the priority level of the root cause a is higher than the priority level of the root cause B. In some embodiments, the fault root cause of the fault alarm may be the first N root causes of the determined fault root causes corresponding to the fault uncertainty map after being sorted according to the priority, where N is a preset number greater than or equal to 1, and if N is 3, the determined fault root causes A, B, C, D corresponding to the fault uncertainty map after being sorted according to the priority are B, D, C, A, and the determined fault root causes of the fault alarm are B, D and C, and of course, the value of N may be set according to the requirements of a specific scenario, which is not limited in this specification.

In addition, for the characteristics of the fault uncertain graph of the fault case, a first-order adjacency matrix (first-order proximity matrix) is usually adopted as the graph characteristics in the related art, however, the first-order adjacency matrix only considers the neighbors of the first-order adjacency matrix and does not consider the mutual influence between the first-order adjacency matrix and other nodes, so that strong contingency easily exists in the application scenario of the present specification, and the accuracy of the subsequently obtained result is low. Based on this, an alternative embodiment of the present specification proposes to use a second-order proximity matrix (second-order proximity matrix) of the fault uncertainty map as a feature of the fault uncertainty map.

The second-order adjacency matrix is generally applied to the deterministic graph, and for the deterministic graph (the probability on the edge is constant to 1), the value of the second-order adjacency matrix at the node pair (u, v) is:

wherein, in the step (A),

representing the node u in the set of neighbor nodes (in the directed graph, the neighbor nodes are upstream nodes) in the graph G;

representing the neighbor node set of the node v in the graph G;

representing the number of intersections of the respective neighbor node sets of the node u and the node v;

the number of union sets of the respective neighbor node sets of the node u and the node v is represented. However, the present embodiment is directed to an uncertain graph, and since the existence probability of a node and an edge is not necessarily 1, it is necessary to calculate the value of the second-order adjacency matrix at the node pair (u, v) in an integral manner, that is, to find an expected value:

. Furthermore, since the value of the edge in the uncertainty map is itself a statistical probability after infinite sampling, this integration method is basically not evaluable for the uncertainty map, and based on this, it is possible to make the following theorem (1)

，

Then, the continuous integral is approximated to an addition/subtraction form of discrete values of finite samples by combining equations (2), (3), (4), (5), (6) and (7), and a final second-order adjacency matrix is calculated.

That is to say, after the intersection and the union of the upstream node sets of any node pair in the uncertain graph are determined, the value of the second-order adjacency matrix at the node pair can be calculated based on the above formula according to the intersection, the union and the probability of each incidence relation, and then the second-order adjacency matrix of the uncertain graph is obtained according to the values of all the node pairs. Through the calculation mode, an N-N adjacency matrix can be obtained, wherein N is the number of nodes in the uncertain graph of the historical fault. In some examples, it may be specified that the uncertainty map of each historical failure contains M nodes at most, that is, the size of the maximum adjacency matrix is M × M, and if the size of the adjacency matrix of a certain historical failure obtained through the above calculation method is less than M × M, the empty position is aligned and supplemented with 0 to the M × M adjacency matrix. The alignment complement 0 may be an upper left alignment complement 0, or a lower left alignment complement 0, an upper right alignment complement 0, or a lower right alignment complement 0, and the like, and it is sufficient to process the uncertainty map of each history fault by using a uniform alignment complement 0 method. In this way, by specifying the size of the adjacency matrix, the features of the uncertain graph of each historical fault can be kept consistent in size form, so as to facilitate the comparison of the matching degree between the features of the derivative graph of the new fault and the follow-up process.

In addition, when the characteristics of the uncertain graphs of each historical fault are stored, the adjacency matrix can be expanded into an M-by-M dimensional characteristic vector for storage. Optionally, M is 20, which is summarized based on actual data; of course, in other embodiments, M herein may also be set according to the situation of a specific scenario, such as networking complexity, collected abnormal index amount, and the like.

To facilitate a more detailed description of the fault root cause location scheme of the present specification, a specific embodiment is described as follows:

in this embodiment, the fault root cause positioning method of the present specification is applied to a communication network, and the data source is abnormal index information detected by an analysis platform of the communication network from a plurality of performance indexes of the device acquired through a gRPC, netconf, and an SNMP protocol, and fault alarm information reported by each module according to the abnormal index information. The embodiment realizes root cause positioning analysis of the fault by processing the data. The present embodiment includes the following four steps:

firstly, constructing a case library:

in the case base of this embodiment, each record represents a historical fault, and the record includes a fault type, a fault root cause, a corresponding uncertainty map, and an adjacency matrix of the historical fault corresponding to the record. It should be noted that, the case base in the related art generally only records that two abnormal indexes are related, and no further derivation is made. The case base constructed in this embodiment is constructed based on preset rules, and the preset rules indicate the probability that one abnormal index causes another abnormal index, which is the result obtained by analyzing and summarizing a large number of faults and abnormal indexes thereof by an operation and maintenance expert. Wherein:

the fault type indicates which type the historical fault belongs to, such as high CPU load, too high memory usage, too high disk space usage, too high disk IO usage, JVM OOM heal, network latency, network isolation, and the like;

the fault root is a root corresponding to the historical fault, and in this embodiment, the fault root may be obtained by manually analyzing the relevant information of the historical fault;

the uncertain graph is a derivative graph with uncertainty of historical fault occurrence, firstly, an abnormal index corresponding to a fault alarm is specified as an initial node of the uncertain graph, secondly, the abnormal indexes 2 minutes before and after the occurrence of a fault are matched according to a preset rule to generate a fault derivative graph, and then all scenes where the same fault possibly occurs are merged, namely, a plurality of fault derivative graphs are merged according to the same fault alarm to finally form the uncertain graph corresponding to the historical fault; as shown in fig. 2, fig. 2 is a schematic diagram of an uncertainty graph of a historical fault according to an exemplary embodiment, where in the historical fault, an abnormal indicator corresponding to a fault alarm is a CPU _ fault, and is an initial node in the uncertainty graph; in the uncertain graph, the weight of an edge represents the existence probability of the edge, for example, the probability of CPUUtil- > CPU _ analog is 0.8, which is obtained according to a preset rule;

the adjacency matrix is a second-order adjacency matrix generated according to the uncertainty map, and for the uncertainty map, the adjacency matrix contains a plurality of map possibilities, each possibility is a possible map thereof, and the value of the second-order adjacency matrix of the uncertainty map at each node pair may be an expected value of all possible maps contained in the uncertainty map. Therefore, the value of the second-order adjacency matrix at the node pair (u, v) is calculated based on equations (1) to (7) mentioned in this specification; moreover, it is specified that each historical fault includes at most 20 nodes, and when the size of the calculated adjacent matrix is less than 20 × 20, the alignment and 0 filling are performed on the vacant position, as shown in fig. 3, fig. 3 is a schematic diagram of forming a 3 × 3 matrix by performing the upper left alignment and 0 filling on the vacant position in the embodiment; finally, the 20 × 20 adjacency matrix is expanded into 400-dimensional feature vectors for storage.

Secondly, extracting the characteristics of the derivative graph of the new fault:

and when the new fault is determined to be generated, acquiring the probability of the incidence relation between the abnormal indexes of the new fault according to a preset rule, and further generating a derivative graph of the new fault. As shown in fig. 4, fig. 4 is a schematic diagram of a derivative graph of a new fault, shown in this specification according to an exemplary embodiment, where in the new fault, an abnormal indicator corresponding to a fault alarm is a CPU _ fault, and is an initial node in the derivative graph; and determining the corresponding edge weight in the derivative graph according to the probability of generating the CPU _ analog by the CPULoad, the probability of generating the CPU _ analog by the CPUutil and the probability of generating the CPU _ fault by the CPU _ analog in the preset rule.

When extracting the characteristics of the new failure derived graph, determining the initial characteristics of each node in the derived graph based on a preset characteristic coding mode, wherein the preset characteristic coding mode is to perform characteristic coding by restricting at most 5 devices and 95 common abnormal indexes, and then determining the final characteristics corresponding to the initial nodes in the derived graph from bottom to top, and taking the final characteristics corresponding to the initial nodes as the graph characteristics of the new failure derived graph. The final feature here represents the accumulation of the products of all upstream nodes contained in the node and the weights on the edge, and the original initial feature of the node itself, following the example in fig. 4, assuming that each node has only 5-dimensional indexes, where the initial feature corresponding to the node of CPULoad is (1, 0, 0, 0, 0), the node of CPUUtil is (0, 1, 0, 0, 0), the node of CPU _ anomaly is (0, 0, 1, 0, 0), the node of CPU _ fault is (0, 0, 0, 1, 0), the final feature corresponding to the node of CPU _ anomaly is (1, 0, 0, 0, 0) + 0.8+ (0, 1, 0, 0, 0, 0) = (0.8), and the final feature corresponding to the node of CPU _ fault is (0, 0, 0), 0.8, 1, 0, 0) × 0.9+ (0, 0, 0, 1, 0) = (0.72, 0.72, 0.9, 1, 0), then the graph characteristics of the derivative graph of the new fault may be represented by (0.72, 0.72, 0.9, 1, 0).

Thirdly, training a matching degree prediction model:

and training a matching degree prediction model based on the constructed case library. In order to amplify the data volume, the uncertain graphs of each historical fault are sampled to obtain a plurality of sub-graphs. As shown in fig. 5, fig. 5 is a partial sub-graph (five sub-graphs 5a, 5b, 5c, 5d, and 5e are shown in the figure) obtained by sampling the uncertainty graph of the historical fault shown in fig. 2. By performing the data enhancement on each historical fault, the problem of model overfitting caused by few cases and insufficient data volume is solved, and the accuracy of the prediction model is greatly improved.

In the training process of the matching degree prediction model, a training sample comprises positive sample characteristics and negative sample characteristics, wherein the positive sample characteristics are from the graph characteristics of the whole graph of the uncertain graph and the graph characteristics of a sub-graph obtained by sampling from the uncertain graph; the negative sample characteristics are from the graph characteristics of the whole graph of the uncertain graph and the graph characteristics of the sub-graph obtained by sampling from other uncertain graphs; the positive sample features correspond to a degree of match of 1 and the negative sample features correspond to a degree of match of 0. Moreover, the graph features of the sub-graph are obtained in a manner of extracting the features of the derivative graph of the new fault in the second part, and are not described herein again.

After the training samples are obtained, the training samples are placed in an XGboost model for training, and a matching degree prediction model is obtained.

Fourthly, positioning root cause of the new fault:

when a new fault is determined to be generated, the graph features of the derivative graph of the new fault are obtained based on the feature extraction mode of the derivative graph of the new fault in the second part, the graph features are combined with the graph features of the uncertainty graph of each historical fault in the case base of the first part, the combined features are put into a trained matching degree prediction model, the matching degree corresponding to each combined feature is obtained through prediction, so that the case with the highest matching degree is determined, the corresponding uncertainty graph is used as a target uncertainty graph, a derivative path is determined in the target uncertainty graph according to the abnormal index of the new fault, and if only one root cause exists in the derivative path, the root cause is the root cause of the new fault.

And for the condition that a plurality of root causes corresponding to the target uncertain graph exist, namely a plurality of root causes exist in the derivation path, performing priority ranking on the root causes on the derivation path, wherein the lower-layer root causes have higher corresponding priorities, and thus, the root cause with the highest priority is determined as the root cause of the new fault.

The scheme of the embodiment is suitable for various application scenarios, especially for the situation that the same fault may correspond to a plurality of different scenarios. It can be seen from the first part and the second part that the requirement on the data volume is not high in the scheme of the embodiment, training is not required in the whole process of feature extraction, and it can be seen from the third part that the scheme of the embodiment can achieve a relatively accurate effect by performing simple training with a small data volume. Due to the fact that the trained model is simple, flexible expansibility is achieved for the situations that new fault cases need to be inserted into the case base, and extra time is not needed to be consumed for retraining. In summary, the scheme of the embodiment can greatly reduce the influence of time delay, disorder of abnormal indexes and repeated abnormal types on case base matching, ensure matching accuracy and have good fault tolerance.

Corresponding to the embodiment of the method, the present specification also provides an embodiment of a fault root cause locating device and a terminal applied by the fault root cause locating device.

The embodiment of the fault root cause positioning device in the specification can be applied to computer equipment, such as a server or terminal equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for operation through the processor in which the file processing is located. From a hardware aspect, as shown in fig. 6, which is a hardware structure diagram of a computer device where a fault root cause positioning apparatus is located in the embodiment of this specification, except for the processor 610, the memory 630, the network interface 620, and the nonvolatile memory 640 shown in fig. 6, a server or an electronic device where an apparatus 631 is located in an embodiment may also include other hardware according to an actual function of the computer device, and details of this are not described again.

Accordingly, the embodiments of the present specification also provide a computer storage medium, in which a program is stored, and the program, when executed by a processor, implements the method in any of the above embodiments.

Embodiments of the present description may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium may be used to store information that may be accessed by a computing device.

As shown in fig. 7, fig. 7 is a block diagram of a fault root cause locating device shown in the present specification according to an exemplary embodiment, the device includes:

an index obtaining module 71, configured to, when a fault alarm is received, obtain an abnormal index causing the fault alarm; the abnormal indexes comprise abnormal indexes of at least one device in a first specified time period before the fault alarm occurs and abnormal indexes of at least one device in a second specified time period after the fault alarm occurs;

the fault derivation module 72 is configured to construct a fault derivation map according to the obtained abnormal indexes; two nodes on any edge with a direction in the fault guidance diagram represent two abnormal indexes in the obtained abnormal indexes, any edge with a direction is used for indicating the incidence relation between the abnormal indexes represented by the two nodes on the edge, and the probability on any edge is used for representing the probability of generating another abnormal index caused by one abnormal index;

a root cause positioning module 73, configured to determine a fault root cause corresponding to the fault alarm according to a matching degree between the fault guidance map and a fault uncertainty map corresponding to the fault case in the established fault case library; the fault uncertainty map of each fault case in the fault case library corresponds to the determined fault root cause, and the fault uncertainty map of each fault case in the fault case library is generated by combining a plurality of fault guidance maps generated under different application scenes and giving an alarm for the same fault.

In an optional embodiment, the index obtaining module 71 is specifically configured to:

taking the fault alarm as an initial node of the fault guidance graph;

In an optional embodiment, the apparatus further comprises:

the root cause positioning module 73 includes:

In an optional embodiment, the determining sub-module is specifically configured to:

In an alternative embodiment, the features of the fault uncertainty map are obtained based on a second-order adjacency matrix of the fault uncertainty map; and the second-order adjacency matrix of the fault uncertainty map is obtained based on calculation of all possible expected values contained in the fault uncertainty map.

In an optional embodiment, the matching degree prediction model is obtained by training based on positive sample features and negative sample features, the positive sample features are obtained by splicing the features of the fault uncertainty maps of the fault cases with the features of sub-maps sampled from the same fault uncertainty map, and the negative sample features are obtained by splicing the features of the fault uncertainty maps of the fault cases with the features of the sub-maps sampled from different fault uncertainty maps; the node set and the edge set of the subgraph are respectively subsets of the node set and the edge set of the uncertain fault graph; the obtaining mode of the characteristics of the subgraph is the same as the obtaining mode of the fault characteristics of the fault derivation graph.

The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A fault root cause positioning method is characterized by comprising the following steps:

constructing a fault guidance diagram according to the acquired abnormal indexes; two nodes on any edge with a direction in the fault guidance diagram represent two abnormal indexes in the obtained abnormal indexes, any edge with a direction is used for indicating the incidence relation between the abnormal indexes represented by the two nodes on the edge, and the probability on any edge is used for representing the probability of generating another abnormal index caused by one abnormal index; the fault derivation graph is constructed based on a matching result of matching abnormal indexes causing the fault alarm according to a pre-specified rule; the pre-specified rule is used for determining the association relation existing among the abnormal indexes and the probability of the association relation;

2. The method according to claim 1, wherein the constructing the fault guidance map according to the obtained abnormal indexes comprises:

taking the fault alarm as an initial node of the fault guidance graph;

and matching the abnormal index of at least one device in a first specified time period before the fault alarm occurs and the abnormal index of at least one device in a second specified time period after the fault alarm occurs according to a pre-specified rule to obtain a middle node and a leaf node of the fault derivation graph so as to construct the fault derivation graph.

3. The method of claim 2, wherein before determining the fault root corresponding to the fault alarm according to the matching degree between the fault derivation graph and the fault uncertainty graph corresponding to the fault case in the established fault case library, the method further comprises: obtaining the corresponding characteristics of the failure uncertain graph of each failure case in the failure case library; the characteristics corresponding to the fault uncertainty map comprise a adjacency matrix of the fault uncertainty map;

4. The method of claim 3, wherein the encoding the failure guidance graph and determining the failure characteristics corresponding to the failure guidance graph according to an information transfer manner comprises:

and gradually transmitting the information of the initial characteristics of the leaf nodes in the fault derivative graph to the initial nodes in a bottom-up mode, and taking the transmitted characteristics of the initial nodes as the fault characteristics corresponding to the fault derivative graph, so that the fault characteristics corresponding to the fault derivative graph are represented by the characteristic information of the abnormal index corresponding to the fault alarm and the characteristic information of all sub-abnormalities contained in the abnormal index corresponding to the fault alarm.

5. The method of claim 3, wherein the features of the fault uncertainty map are derived based on a second order adjacency matrix of the fault uncertainty map; and the second-order adjacency matrix of the fault uncertainty map is obtained based on calculation of all possible expected values contained in the fault uncertainty map.

6. The method according to claim 3, wherein the matching degree prediction model is obtained by training based on positive sample features and negative sample features, the positive sample features are obtained by splicing the features of the fault uncertainty maps of the fault cases with the features of sub-graphs sampled from the same fault uncertainty maps, and the negative sample features are obtained by splicing the features of the fault uncertainty maps of the fault cases with the features of sub-graphs sampled from different fault uncertainty maps; the node set and the edge set of the subgraph are respectively subsets of the node set and the edge set of the uncertain fault graph; the obtaining mode of the characteristics of the subgraph is the same as the obtaining mode of the fault characteristics of the fault derivation graph.

7. The method according to claim 1, characterized in that the fault root cause corresponding to the fault alarm is determined based on the fault root cause corresponding to the fault uncertainty map with the highest matching degree in each fault uncertainty map; and if the fault uncertain graph with the highest matching degree has a plurality of fault root causes, determining the corresponding fault root cause of the fault alarm based on the priority of each fault root cause of the fault uncertain graph.

8. A fault root cause locating device, comprising:

the fault derivation module is used for constructing a fault derivation graph according to the acquired abnormal indexes; two nodes on any edge with a direction in the fault guidance diagram represent two abnormal indexes in the obtained abnormal indexes, any edge with a direction is used for indicating the incidence relation between the abnormal indexes represented by the two nodes on the edge, and the probability on any edge is used for representing the probability of generating another abnormal index caused by one abnormal index; the fault derivation graph is constructed based on a matching result of matching abnormal indexes causing the fault alarm according to a pre-specified rule; the pre-specified rule is used for determining the association relation existing among the abnormal indexes and the probability of the association relation;

9. The apparatus of claim 8, wherein the indicator obtaining module is specifically configured to:

taking the fault alarm as an initial node of the fault guidance graph;

10. The apparatus of claim 9, further comprising:

the root cause location module includes:

11. The apparatus of claim 10, wherein the determination submodule is specifically configured to:

12. The apparatus of claim 10, wherein the features of the fault uncertainty map are derived based on a second order adjacency matrix of the fault uncertainty map; and the second-order adjacency matrix of the fault uncertainty map is obtained based on calculation of all possible expected values contained in the fault uncertainty map.

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any of claims 1 to 7.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.