CN115150255A

CN115150255A - Self-adaptive knowledge graph-based automatic root cause positioning method for application faults

Info

Publication number: CN115150255A
Application number: CN202210822528.1A
Authority: CN
Inventors: 沈国鹏; 朱品燕
Original assignee: Beijing Yunji Zhizao Technology Co ltd
Current assignee: Beijing Yunji Zhizao Technology Co ltd
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-10-04
Anticipated expiration: 2042-07-12
Also published as: CN115150255B

Abstract

The invention discloses a self-adaptive knowledge graph-based automatic root cause positioning method for application faults, which comprises an off-line training part and an on-line root cause positioning part, wherein the off-line training part uses CMDB data and an application calling relation to construct a software and hardware knowledge graph; mining an alarm knowledge graph by using historical alarm or fault alarm data; the input of the online root cause positioning is alarm data of a fault time interval and two knowledge maps obtained by training in an offline training process, namely a software and hardware knowledge map and an alarm knowledge map, and then a fault root cause node and a fault propagation path are obtained through a root cause positioning algorithm. According to the invention, the alarm classification algorithm and the causal mining algorithm are used for automatically mining the causal relationship between the abstract alarm classification and the abstract alarm classification from the historical alarm data to construct the alarm knowledge map, so that the effects and interpretability of noise reduction clustering and root cause positioning of subsequent alarms are improved.

Description

Self-adaptive knowledge graph-based automatic root cause positioning method for application faults

Technical Field

The invention relates to the field of application faults of knowledge graphs, in particular to a self-adaptive knowledge graph-based automatic root cause positioning method for application faults.

Background

Alarms are data generated by an engineer triggered by a series of expert rules set on various collected monitoring data (indexes, logs, etc.) in order to guarantee IT service quality. When an alarm occurs, the operation and maintenance engineer will check each piece of alarm data to confirm whether the system occurs and what fault occurs. Due to the size and complexity of an online system, failures are inevitable, and due to the complex dependency relationship of system components and redundant monitoring among different monitoring systems, once a certain component of the system fails, chain reaction is triggered to cause a plurality of components of the system to have problems, and the same problem may trigger a plurality of monitoring systems to generate alarms, and further trigger a large amount of alarms in a short time, which is called an alarm storm. In this case, it is difficult for operation and maintenance personnel to efficiently perform fault analysis and root cause location.

The current solution is alarm clustering based on cross entropy, specifically, clustering alarm information by using cross entropy according to alarm scenes and rules to realize alarm convergence, and improving the efficiency of fault analysis of operation and maintenance personnel by this way. And secondly, auxiliary root cause positioning is carried out by establishing an expert rule.

The prior art has the following defects:

1. convergence rules requiring manual combing and maintenance of alarms

2. Rules requiring manual combing and maintenance of root cause positioning

3. The determination of manual rules relies on expert experience, may not be accurate in itself, and may not be applicable after a system update.

4. When the system introduces a new alarm rule, the rule needs to be manually updated to deal with the new alarm rule.

5. The method based on alarm convergence cannot directly give out fault root cause, the alarm convergence effect based on cross entropy is limited, and the method can only reduce about 30% of alarms and cannot effectively solve the problem of alarm storm.

6. The alarms associated with system faults cannot all be aggregated together, i.e., fault diagnosis is not facilitated well.

7. The versatility is low, different systems may need to reconfigure new manual rules, configuration cost and efficiency are low.

8. Low interpretability

9. An accurate root cause node and fault propagation path cannot be automatically given.

Based on the above reasons, an adaptive knowledge graph-based application failure automatic root cause positioning method becomes a technical problem to be solved urgently in the whole society.

Disclosure of Invention

The invention mainly solves the following problems:

(1) When a system fault occurs, an alarm storm can be caused, and operation and maintenance personnel do not have an efficient means to automatically alarm, reduce noise and cluster, so that the number of alarms needing to be checked is excessive.

(2) When an alarm storm occurs, operation and maintenance personnel do not have an effective means to perform automatic root cause positioning and fault propagation chain analysis.

(3) The existing alarm noise reduction clustering and root cause positioning technology is basically based on expert rules, and has the disadvantages of difficult maintenance, high maintenance cost and weak mobility.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: a self-adaptive knowledge graph-based automatic root cause positioning method for application faults comprises an off-line training part and an on-line root cause positioning part, wherein the off-line training part uses CMDB data and an application calling relation to construct a software and hardware knowledge graph; mining an alarm knowledge graph by using historical alarm or fault alarm data;

the input of the online root cause positioning is alarm data of a fault time interval and two knowledge maps obtained by training in an offline training process, namely a software and hardware knowledge map and an alarm knowledge map, and then a fault root cause node and a fault propagation path are obtained through a root cause positioning algorithm.

Further, the input data mined by the off-line training part on the alarm knowledge graph is historical alarm data or historical fault alarm data for a certain time, and the historical alarm data or the historical fault alarm data are called training data.

Further, the following root cause locating process is carried out on each fault set:

(1) Constructing an initial fault sub-graph, namely acquiring graph nodes and edges associated with the alarms in the fault set as the initial fault sub-graph through the association relationship between the alarms and the software and hardware knowledge graph nodes; the initial fault sub-graph is a sub-graph of the software and hardware knowledge graph; then mapping the alarm as a node attribute on a corresponding fault sub-graph node, and calculating an initial root score and an edge weight of the fault sub-graph through an alarm knowledge graph, an edge relation of the fault sub-graph and alarm information on the node including alarm time information; specifically, if the alarm time of the alarm associated with a node is earlier in the fault set, the root cause score of the node is higher; two nodes with edge relation exist, in the example of A and B, the direction of the edge is that A points to B, if the alarm type of the node A and the alarm type in the node B have causal relation on the alarm knowledge graph, the root score of the node A is improved, the root score of the node B is reduced, and the weight of the edge is improved; finally, the initial root factor scores and the weights of all the nodes are normalized, and the construction of an initial fault subgraph is completed;

(2) The root cause scoring reasoning is carried out, the initial fault subgraph is input, the final root cause score of each node of the fault subgraph is calculated by using a graph-based root cause scoring algorithm, the fault subgraph can be obtained, and the node with the highest root cause score is used as a root cause node; the graph-based root cause scoring algorithm includes, but is not limited to, a PageRank algorithm, a Personlix PageRank algorithm and the like, as long as the input is a fault subgraph and the output is a root cause score of each node of the fault subgraph, the so-called root cause score is a quantitative index that the node is a root cause node, and the PageRank-based root cause scoring algorithm takes the pr value of the node output by the PageRank algorithm as the root cause score of the node;

(3) The root cause link mining, wherein the input is the fault subgraph and the root cause nodes, candidate links are firstly mined, the root cause nodes are used as starting nodes, and all links starting from the root cause nodes are mined as candidate root cause links by using a graph algorithm similar to a depth algorithm; and then calculating the ranking index of each candidate link, ranking the candidate root links according to the ranking index, and reserving TopN as the finally output root link.

Further, the ranking index in step (3) refers to a sum of root cause scores of nodes on a link plus weights of edges divided by the number of edges and nodes, and is used as the ranking index of the link.

Further, the alarm data firstly uses the software and hardware knowledge graph and the alarm knowledge graph obtained by the off-line training part and the alarm classification model to perform alarm noise reduction clustering on the alarms.

Compared with the prior art, the invention has the advantages that:

(1) And the alarm classification algorithm and the causal mining algorithm are used for automatically mining the causal relationship between the abstract alarm classification and the abstract alarm classification from the historical alarm data to construct an alarm knowledge map, so that the effects and interpretability of subsequent alarm noise reduction clustering and root cause positioning are improved.

(2) CMDB data and application calling relation are used in the alarm noise reduction clustering and root cause positioning process, and the effectiveness and interpretability of the results of the alarm noise reduction clustering and root cause positioning are strong.

(3) The off-line training process and the on-line root positioning process only depend on given data (alarm, CMDB data and application calling relation) and do not depend on expert knowledge, so that the maintenance cost is low and the mobility is strong.

(4) In the on-line root cause positioning process, only the alarm data of the fault time interval is needed to automatically deduce the root cause of the fault and the fault propagation link.

(5) The time for troubleshooting and repairing the operation and maintenance personnel can be greatly reduced, and the reliability of the product is improved.

Drawings

FIG. 1 is a diagram of an adaptive knowledge-graph-based application failure automatic root cause location method according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The present invention will be described in detail with reference to the accompanying drawings.

The invention provides a self-adaptive knowledge-graph-based application fault automatic root cause positioning method in concrete implementation,

the method comprises an offline training part and an online root cause positioning part, wherein the offline training part uses CMDB data and an application calling relation to construct a software and hardware knowledge graph; mining an alarm knowledge graph by using historical alarm or fault alarm data;

The specific implementation principle and process are as follows:

an off-line training part (map mining) and an on-line root cause positioning part. An off-line training part, namely constructing a software and hardware knowledge graph by using CMDB data and an application calling relation; alarm knowledge maps are mined using historical alarm or fault alarm data. The input of the online root cause positioning is alarm data of a fault time interval and two knowledge maps obtained by training in an offline training process, namely a software and hardware knowledge map and an alarm knowledge map, and then a fault root cause node and a fault propagation path are obtained through a root cause positioning algorithm.

And the off-line training part is a software and hardware knowledge graph construction, and the input of the block is CMDB data and an application calling relation (possibly stored in the CMDB), wherein the CMDB data refers to a resource deployment relation of an application system and comprises nodes and edge relations among the nodes. The node types comprise systems, modules, components, DUs (deployment units), groups (host instance groups), software, virtual machines, physical machines, access switches, core switches, aggregation switches, routers and the like; among the edge relationships are coherence (composition), call (call), local (logical connection), cluster (aggregation), ship (bearer), host (host), connect (physical connection), and so on. The application call relation refers to a call relation between application class nodes (e.g. systems, modules, components, microservices, and the like) in the CMDB. And then, inserting the nodes and the edge relations (application calling relations are also variable relations) between the nodes into the graph database by using a script to complete the construction of the software and hardware knowledge graph.

And an off-line training part, namely alarm knowledge graph mining, wherein the input data of the block is historical alarm data (or historical fault alarm data) for a certain time, and the block of input data is called training data subsequently. Firstly, training an alarm classification algorithm by using training data to obtain an alarm classification model, wherein the alarm classification model is used for mapping an alarm instance to an abstract alarm type (for example, all CPU alarm instances are mapped to the abstract alarm type of 'CPU alarm'), and the abstract alarm type meets two conditions, namely, the abstract alarm type has a clear meaning, and an operation and maintenance engineer can know the alarm meaning of the alarm type through the description of the abstract alarm type; secondly, the number is countable, that is, the abstract alarm types are not suitable to be too many, otherwise, the abstract classification function is lost. The specific alarm classification algorithm is not limited, and the above purposes can be achieved, for example, a rule-based alarm classification algorithm can be used, that is, if the alarm data includes an alarm rule configured by an operation and maintenance engineer, the alarm classification algorithm can be realized by directly using the alarm rule as an abstract alarm type; for example, the unsupervised alarm classification algorithm based on clustering, namely clustering by using a text clustering algorithm by using the message content of the alarm, and taking a class cluster center as abstract alarm classification; for example, the supervised alarm classification algorithm is that an operation and maintenance engineer firstly performs alarm classification labeling on training data and then trains a supervised classification model (including but not limited to a neural network model). After the alarm classification model is obtained through training, a causal relationship between the abstract alarm classifications is mined by using a causal discovery algorithm. Firstly, a cause and effect discovery sample M is constructed by using training data and an alarm classification model, wherein the cause and effect discovery sample is a two-dimensional matrix, each column represents an alarm classification, each row represents a time window, and M [ i, j ] represents the number of alarm instances corresponding to the jth abstract alarm type in the ith time window. The alarm data is a time sequence data, so the cause and effect discovery sample can be directly constructed through the training data and the alarm classification model. And then, a causal discovery algorithm is used for mining the causal relationship among the alarm types by using a causal mining algorithm, the output causal relationship is a plurality of edge triplets, the data structures of the edge triplets are (A, B and F), wherein A and B are abstract alarm types, F is the weight of an edge, the value is between 0 and 1, the strength of the edge (namely the strength of the causal relationship) is represented, and the direction of the edge is that A points to B. The specific algorithm of the causal mining algorithm is not limited, and the causal relationship can be effectively mined, and representative algorithms include a PC algorithm, a PCMCI algorithm, a clustering algorithm and the like, and the integration of the algorithms. And then inserting all alarm classifications obtained by the alarm classification model and all causal relationships obtained by causal mining into the graph database to obtain the alarm knowledge graph.

The on-line root cause positioning part can automatically trigger the root cause positioning process when a system fault occurs, and a specific automatic triggering mechanism is out of the discussion range. The input of the positioning part is alarm data of a fault occurrence period, firstly, the alarms are subjected to alarm noise reduction clustering by using a software and hardware knowledge graph, an alarm knowledge graph and an alarm classification model which are obtained by an offline training part, the main principle is to calculate the weighted distances of the alarms in a time dimension, a topology dimension (calculated by the software and hardware knowledge graph) and a description text dimension (calculated by the alarm knowledge graph) (the smaller the distance is, the more possible the two alarms belong to the same fault), and then, the alarms are subjected to noise reduction clustering by using a clustering algorithm (s.g.DBSCAN algorithm) through the weighted distances so as to aggregate the alarms caused by the same fault into the same cluster and remove noise alarms irrelevant to the fault, wherein one cluster is called a fault set and consists of a plurality of alarms.

Then, performing a root cause positioning process on each fault set:

(1) And constructing an initial fault sub-graph, namely acquiring graph nodes and edges associated with the alarms in the fault set as the initial fault sub-graph through the association relationship between the alarms and the software and hardware knowledge graph nodes. The initial failure sub-graph is a sub-graph of the software and hardware knowledge graph. And then mapping the alarm as a node attribute on the corresponding fault sub-graph node. And then calculating the initial root score and the weight of the edge of the fault subgraph through the alarm knowledge graph and the edge relation of the fault subgraph and the alarm information (including the time information of the alarm) on the node. Specifically, if the alarm time of the alarm associated with a node is earlier in the fault set, the root cause score of the node is higher; if the alarm type of the node A and the alarm type in the node B have a causal relationship on the alarm knowledge graph, the root cause score of the node A is increased, the root cause score of the node B is decreased, and the weight of the edge is increased. And finally, normalizing the initial root factor scores and the weights of all the nodes to complete the construction of the initial fault subgraph.

(2) And (4) root cause score reasoning, inputting the initial fault subgraph, and calculating the final root cause score of each node of the fault subgraph by using a graph-based root cause score algorithm to obtain the fault subgraph, wherein the node with the highest root cause score is used as a root cause node. The graph-based root cause scoring algorithm includes, but is not limited to, a PageRank algorithm, a Personlix PageRank algorithm and the like, as long as the input is a fault subgraph and the output is a root cause score of each node of the fault subgraph, wherein the root cause score is a quantitative index that the node is a root cause node. The root factor scoring algorithm based on the PageRank algorithm takes the pr value of the node output by the PageRank algorithm as the root factor score of the node.

(3) Root cause link mining. The input is the fault subgraph and the root node, candidate links are firstly mined, the root node is used as a starting node, and all links starting from the root node are mined as candidate root links by using a graph algorithm similar to a depth algorithm. And then calculating a ranking index of each candidate link (the root cause score of a node on one link plus the weight sum of the upper edge is divided by the number of edges and nodes to serve as the ranking index of the link), ranking the candidate root cause links according to the ranking index, and reserving TopN as the finally output root cause link.

And outputting the whole root cause positioning part, and obtaining a fault child graph, a root cause node and a root cause link.

The present invention and its embodiments have been described above, and the description is not intended to be limiting, and the drawings are only one embodiment of the present invention, and the actual structure is not limited thereto. In summary, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A self-adaptive knowledge graph-based automatic root cause positioning method of application faults is characterized by comprising the following steps: the off-line training part uses CMDB data and application calling relation to construct a software and hardware knowledge graph; mining an alarm knowledge graph by using historical alarm or fault alarm data;

2. The adaptive knowledge-graph-based application failure automatic root cause locating method according to claim 1, characterized in that: the input data mined by the off-line training part in the alarm knowledge graph is historical alarm data or historical fault alarm data for a certain time, and the historical alarm data or the historical fault alarm data is called training data.

3. The adaptive knowledge-graph-based application failure automatic root cause locating method according to claim 1, characterized in that: and performing a root cause positioning process on each fault set, wherein the root cause positioning process comprises the following steps:

(1) Constructing an initial fault subgraph, namely acquiring graph nodes and edges associated with the alarms in the fault set as the initial fault subgraph through the association relationship between the alarms and the software and hardware knowledge graph nodes; the initial fault sub-graph is a sub-graph of the software and hardware knowledge graph; then mapping the alarm as a node attribute on a corresponding fault sub-graph node, and calculating an initial root score and an edge weight of the fault sub-graph through an alarm knowledge graph, an edge relation of the fault sub-graph and alarm information on the node including alarm time information; specifically, if the alarm time of the alarm associated with a node is earlier in the fault set, the root cause score of the node is higher; two nodes with edge relation exist, in the example of A and B, the direction of the edge points to B, if the alarm type of the node A and the alarm type in the node B have causal relation on the alarm knowledge graph, the root score of the node A is improved, the root score of the node B is reduced, and the weight of the edge is improved; finally, the initial root cause scores and the weights of all the nodes are normalized, and the construction of an initial fault sub-graph is completed;

4. The adaptive knowledge-graph-based application failure automatic root cause locating method according to claim 3, wherein: the ranking index in the step (3) is the sum of root cause scores of nodes on one link and the weight of the edges divided by the number of the edges and the nodes and serves as the ranking index of the link.

5. The adaptive knowledge-graph-based application failure automatic root cause locating method according to claim 1, characterized in that: the alarm data firstly uses the software and hardware knowledge graph and the alarm knowledge graph obtained by the off-line training part and the alarm classification model to perform alarm noise reduction clustering on the alarms.