CN115514627A

CN115514627A - Fault root cause positioning method and device, electronic equipment and readable storage medium

Info

Publication number: CN115514627A
Application number: CN202211151739.3A
Authority: CN
Inventors: 王雄; 郜振锋
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-12-23

Abstract

The invention discloses a fault root cause positioning method, a device and a system and a computer readable storage medium, which are applied to the technical field of computers, wherein the method comprises the following steps: acquiring a service layer performance index of an abnormal node; establishing a fault propagation diagram according to configuration information and abnormal alarm information of the abnormal node; the fault propagation graph comprises fault propagation relationships between the components; carrying out root cause positioning based on the index cause and effect diagram, the service layer performance index and the fault propagation diagram to obtain root cause information; the index causal graph is established based on service layer performance indexes of nodes in the cluster, and comprises causal relations among the indexes; the method and the system can combine the causal relationship of the performance indexes of the service layer, are favorable for accurately positioning the root cause of the abnormal event of the index layer, and improve the efficiency and the accuracy of root cause positioning.

Description

Fault root cause positioning method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for locating a fault root cause, an electronic device, and a computer-readable storage medium.

Background

Most IT (internet Technology ) facilities have implemented the collection and real-time monitoring of software and hardware index data and log data and alarm pushing after the abnormality is found. As the IT system grows in size, the relationships among the components become more complex, and in actual production operation, once a certain component generates an exception, the associated component also fails, and the exception acts rapidly in the system, so that a large number of alarm storms are triggered.

The alarm information can inform each operation and maintenance group in a message form, a large IT system usually needs a plurality of different operation and maintenance groups relating to services, networks, basic equipment, services and the like, the alarm information received by related groups is usually limited to an alarm object which is difficult to accurately position, and a manual fault diagnosis mode is time-consuming and easy to make mistakes.

Therefore, how to provide a fault root cause positioning method, apparatus, electronic device and computer readable storage medium for improving positioning accuracy becomes a problem to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the invention aims to provide a fault root cause positioning method, a fault root cause positioning device, electronic equipment and a computer readable storage medium, which are beneficial to accurately positioning the root cause of an index level abnormal event in the using process and improve the efficiency and the accuracy of root cause positioning.

In order to solve the above technical problem, an embodiment of the present invention provides a method for locating a fault root cause, including:

determining an abnormal performance index based on the service layer performance index of the abnormal node;

establishing a fault propagation diagram according to the configuration information and the abnormal alarm information of the abnormal node; the fault propagation graph comprises fault propagation relationships between components;

carrying out root cause positioning based on the index cause and effect diagram, the abnormal performance index and the fault propagation diagram to obtain root cause information; the index causal graph is established based on service layer performance indexes of nodes in the cluster, and comprises causal relationships among the indexes.

Optionally, the index causal graph is established based on service layer performance indexes of nodes in a cluster, and includes:

acquiring service layer performance indexes of a plurality of nodes in a cluster;

classifying the service layer performance indexes to obtain a multi-class index set aiming at each node;

aiming at each type of index set, constructing a corresponding sub-causal graph according to the index set by adopting a pre-established causal analysis model;

combining the sub-causal graphs of the nodes belonging to the same index class to obtain a causal graph corresponding to each index class;

and constructing an index causal graph according to each causal graph.

Optionally, the causal analysis model is established based on multiple linear non-gaussian causal Lingam algorithms;

the method for constructing the corresponding sub-causal graph according to the index set by adopting the pre-established causal analysis model comprises the following steps:

constructing a corresponding sub-causal graph based on the index set by adopting each Lingam algorithm;

and combining the sub-causal graphs corresponding to each Lingam algorithm to obtain the final sub-causal graph.

Optionally, the merging the sub-causal graphs corresponding to each Lingam algorithm to obtain a final sub-causal graph includes:

and combining the sub-causal graphs corresponding to each Lingam algorithm, and pruning the causal chain of each sub-causal graph in the combination process to obtain the final sub-causal graph.

Optionally, the pruning the causal chain of each of the sub-causal graphs includes:

pruning the causal chain of each sub-causal graph by adopting a frequent item mining method and maximum entropy thresholding.

Optionally, the establishing a fault propagation map according to the configuration information and the abnormal alarm information of the abnormal node includes:

establishing a resource configuration diagram according to the configuration information of the abnormal node;

constructing a network service call graph according to the request and call relation among services and the resource configuration graph;

and constructing a fault propagation diagram based on the abnormal alarm information and the network service call diagram.

Optionally, the component fault propagation map based on the abnormal alarm information and the network service call map includes:

performing alarm suppression on the abnormal alarm information;

checking the abnormal alarm information after alarm inhibition by adopting the work order abnormal event corresponding to the abnormal node to determine an effective abnormal alarm;

and constructing a fault propagation diagram based on the effective abnormal alarm and the network service call diagram.

Optionally, the root cause information includes: root cause indicators and root cause components;

the method for carrying out root cause positioning based on the pre-established index cause and effect diagram, the service layer performance index and the fault propagation diagram to obtain root cause information comprises the following steps:

carrying out fault propagation analysis on the fault propagation diagram to determine a root cause component;

screening out a performance index corresponding to the root factor component from the service layer performance indexes;

carrying out abnormity detection on the performance index to obtain an abnormity index;

and analyzing the abnormal propagation based on the abnormal index and the index causal graph to determine a root index.

Optionally, the method further includes:

and obtaining a root factor link based on the root factor index and the root factor component.

The embodiment of the present invention further provides a fault root cause positioning apparatus, including:

the identification module is used for determining abnormal performance indexes based on the service layer performance indexes of the abnormal nodes;

the establishing module is used for establishing a fault propagation diagram according to the configuration information and the abnormal alarm information of the abnormal node; the fault propagation graph comprises fault propagation relationships between components;

the positioning module is used for carrying out root cause positioning on the basis of the index cause and effect diagram, the abnormal performance index and the fault propagation diagram to obtain root cause information; the index causal graph is established based on service layer performance indexes of nodes in the cluster, and comprises causal relationships among the indexes.

The embodiment of the present invention further provides a system for locating a fault root cause, including:

a memory for storing a computer program;

and the processor is used for realizing the steps of the fault root cause positioning method when the computer program is executed.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the fault root cause location method are implemented as described above.

The embodiment of the invention provides a fault root cause positioning method, a device and a system and a computer readable storage medium, wherein the method comprises the following steps: acquiring a service layer performance index of an abnormal node; establishing a fault propagation diagram according to configuration information and abnormal alarm information of the abnormal node; the fault propagation graph comprises fault propagation relationships between components; carrying out root cause positioning based on the index cause and effect diagram, the service layer performance index and the fault propagation diagram to obtain root cause information; the index causal graph is established based on service layer performance indexes of nodes in the cluster, and comprises causal relations among the indexes.

It can be seen that, in the embodiments of the present invention, an index causal graph is established in advance based on service layer performance indexes of nodes in a cluster, where the index causal graph includes causal relationships between indexes, then, when an abnormal node exists in the cluster, the service layer performance index of the abnormal node is obtained, then, a fault propagation graph of the abnormal node is established according to configuration information and abnormal alarm information of the abnormal node, where the fault propagation graph includes fault propagation relationships between components, and then, root positioning is further performed based on the index causal graph, the service layer performance index of the abnormal node, and the fault propagation graph, and root information is obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed in the prior art and the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a fault root cause positioning method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a fault root cause locating method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a process for creating an indicator causality graph according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a root cause location process according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a root cause link according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a fault root cause locating device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a fault root cause positioning system according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a fault root cause positioning method and device, electronic equipment and a computer readable storage medium, which are beneficial to accurately positioning the root cause of an index level abnormal event in the using process and improving the root cause positioning efficiency and accuracy.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

IT should be noted that, in the prior art, acquisition of the software and hardware index data and the operation log data of the IT facility has been realized, and meanwhile, real-time monitoring is performed on the data, and alarm pushing is performed after an abnormality is found. Because IT is large in size and has many components, when a component is abnormal, other components related to the component may be abnormal. At present, alarm information can be pushed to different operation and maintenance groups, and each operation and maintenance group carries out fault diagnosis according to the alarm information received by the operation and maintenance group, so that the root fault is difficult to accurately position. In view of this, the present invention provides a method, an apparatus, a system, and a computer readable storage medium for locating a root cause of a fault, and for a certain cluster, an index cause-and-effect graph may be established in advance based on service layer performance indexes of nodes in the cluster, where the index cause-and-effect graph includes cause-and-effect relationships between indexes, then when an IT system is abnormal, an abnormal node is determined, then a service layer performance index of the abnormal node is obtained, and an abnormal performance index is determined by monitoring the service layer performance index, configuration information and abnormal alarm information of the abnormal node are obtained, and a fault propagation graph is established based on the configuration information and the abnormal alarm information, where the fault propagation graph includes a fault propagation relationship between components, and further root cause locating is performed based on the fault propagation graph, the pre-established index cause-and abnormal performance indexes of the abnormal node, so as to obtain root cause information, and a specific application scenario diagram may be referred to fig. 1.

A method for locating a root cause of a fault provided in an embodiment of the present application is described in detail below. Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a method for locating a fault root cause according to an embodiment of the present invention. The method includes S110 to S130.

IT should be noted that, in practical applications, for a cluster corresponding to a certain IT system, an index cause-and-effect graph may be established based on Service Layer performance indexes of nodes (one or more nodes) in the cluster, where the established index cause-and-effect graph includes cause-and-effect relationships between indexes, where a Service Layer performance index refers to a historical Service Layer performance index of a node, specifically may be a Service Layer performance index of the node at a certain historical time, and may also be an average value of the Service Layer performance indexes of the node at each time in a certain historical time period, specifically which way to obtain the Service Layer performance index of the node to establish the index cause-and-effect graph.

S110: acquiring a service layer performance index of an abnormal node;

specifically, when it is monitored that the cluster is abnormal, an abnormal node may be further determined, and then root cause positioning is performed on the abnormal node, so as to specifically obtain a service layer performance index of the abnormal node, for example, obtain a service layer performance index of the abnormal node within a preset time before the time when the cluster is detected to be abnormal.

S120: establishing a fault propagation diagram according to configuration information and abnormal alarm information of the abnormal node; the fault propagation graph comprises fault propagation relationships between components;

it should be noted that alarm data is collected when the cluster is abnormal, so after the abnormal node is determined, configuration information and abnormal alarm information of the abnormal node can be further obtained, and a fault propagation diagram corresponding to the abnormal node, that is, a topological link diagram between the device and the component, can be established according to the configuration information and the abnormal alarm information. And a fault propagation closed loop is formed, and the fault propagation relation among the components can be clearly shown through the fault propagation diagram.

S130: carrying out root cause positioning based on a pre-established index cause and effect diagram, service layer performance indexes and a fault propagation diagram to obtain root cause information; the index causal graph is established based on service layer performance indexes of nodes in the cluster, and comprises causal relationships among the indexes.

It can be understood that the performance index with abnormality can be determined according to the service layer performance index of the abnormal node, an abnormal component with a large abnormal degree can be determined by performing root cause positioning analysis on the fault propagation diagram, the abnormal component is taken as a root cause abnormal component, an abnormal index corresponding to the root cause component is determined from the abnormal performance index, the abnormal degree of each abnormal index is determined, then the abnormal index with a large influence degree on the root cause component can be determined by performing root cause positioning analysis on the index cause-effect diagram and the abnormal index, and the abnormal index is taken as the root cause index, so that root cause information can be obtained by positioning, that is, the root cause information can include the root cause component and the root cause index.

According to the technical scheme, the index causal graph is established in advance based on the service layer performance indexes of the nodes in the cluster, the index causal graph comprises causal relationships among indexes, the service layer performance indexes of the abnormal nodes are obtained under the condition that the abnormal nodes exist in the cluster, then the fault propagation graph of the abnormal nodes is established according to the configuration information and the abnormal alarm information of the abnormal nodes, the fault propagation graph comprises the fault propagation relationships among components, and then root cause positioning is further carried out based on the index causal graph, the service layer performance indexes of the abnormal nodes and the fault propagation graph, and root cause information is obtained.

With respect to the above embodiments, the embodiments of the present invention further describe and optimize the technical solutions, which are specifically as follows:

referring to fig. 3, in the foregoing embodiment, the process of establishing an index causal graph in advance based on service layer performance indexes of nodes in a cluster may specifically include:

s210: acquiring service layer performance indexes of a plurality of nodes in a cluster;

it should be noted that, in the embodiment of the present invention, the multiple nodes may be all nodes in the cluster or may be part of nodes, and the embodiment of the present invention is not particularly limited. In practical application, the service layer performance indexes of the plurality of nodes at a certain historical time may be specifically used, or the service layer performance indexes of the nodes at various times in a certain historical time period may be obtained, and the index average value of each service layer performance index of the node is calculated according to the service layer performance indexes at various times, so as to obtain the respective service layer performance indexes of the plurality of nodes.

S220: classifying the service layer performance indexes to obtain a multi-class index set aiming at each node;

specifically, in the embodiment of the present invention, a node is taken as an example to describe in detail, after a service layer performance index of the node is obtained, the service layer performance index may be classified according to expert prior knowledge to obtain multiple types of index sets, for example, the service layer performance index sets may be classified into types of index sets such as a Central Processing Unit (CPU), a memory, a network, and a disk, where the CPU index set may include, but is not limited to, a host CPU, a virtual machine cache, host global data, and the like, the memory includes, but is not limited to, host memory data and virtual machine memory data, and the network includes, but is not limited to, a host network, and the disk includes, but is not limited to, a host disk. Of course, in practical application, what types of index sets the service layer indexes are divided into may be determined according to practical situations, and the embodiment of the present invention is not particularly limited.

S230: aiming at each type of index set, constructing a corresponding sub-causal graph according to the index set by adopting a pre-established causal analysis model;

specifically, in the embodiment of the present invention, a case of constructing a sub-causal graph based on a type of index set is described in detail by taking the index set as an example. In practical application, a causal analysis model is established in advance, then an index set is used as input of the causal analysis model, the index set is analyzed through the causal analysis model, a corresponding sub-causal graph is constructed, and the sub-causal graph is output. The constructed sub-causal graph comprises nodes as indexes, each causal chain is included, causal relationships exist among the nodes connected with the causal chain, the causal chains have corresponding edge weights, and each node has a corresponding node weight.

S240: combining the sub-causal graphs of each node belonging to the same index type to obtain a causal graph corresponding to each index type;

specifically, after the sub-causal graphs of various indexes of each node are obtained, the sub-causal graphs corresponding to the indexes in all the nodes are combined for each index, so that the causal graph corresponding to the indexes is obtained. For example, for an M-class index, the sub-causal graph corresponding to the M-class index in node 1 is the sub-causal graph M1, the sub-causal graph corresponding to the M-class index in node 2 is the sub-causal graph M2, the sub-causal graph corresponding to the M-class index in node 3 is the sub-causal graph M3, and the sub-causal graph corresponding to the M-class index in node n is the sub-causal graph Mn, so that all the sub-causal graphs corresponding to the M-class index are merged, that is, the sub-causal graphs M1, M2, M3 \8230, mn are merged to obtain the causal graph corresponding to the M-class index, and similarly, the causal graph corresponding to each class of index can be obtained.

S250: and constructing an index causal graph according to each causal graph.

Specifically, the final index causal graph can be obtained by integrating the causal graphs corresponding to each type of index obtained above.

Further, the causal analysis model in S230 is established based on a plurality of linear non-gaussian causal Lingam algorithms;

it should be noted that the basic form of the linear non-gaussian causal Lingam model is as follows:

wherein the content of the first and second substances, _xi represents the observed performance data of the index, _bij representing variables in directed acyclic graphs _xi To _xj The strength of the connection of (a) to (b), _xj the j-th index performance data is shown, _ei representing noise variations, subject to a non-gaussian distribution of non-zero variance, _k(i) the number of the effect variables is represented, _k(j) representing dependent variables

Specifically, in practical application, a causal analysis model can be specifically established based on an ICALingam algorithm, a diectingam algorithm and a notetersmlp algorithm. The ICALingam algorithm learns the causal sequence by adopting Independent Component Analysis (ICA) so as to obtain a causal network; however, for all observed index data, which have some hidden variables, that is, there may exist root cause that is not in an index set, and it is necessary to search for approximate exogenous variables, and a basic strategy in constructing a causal network by using the directlngam algorithm is to select exogenous variables layer by layer, and further construct the entire network, and it is able to effectively find exogenous variables, but since a causal analysis model established based only on the directlngam algorithm can only select exogenous variables well below 25 dimensions, whereas the index data involved in the embodiment of the present invention is high-dimensional data, usually above 1000 dimensions, the directlngam algorithm will make an error when selecting exogenous variables in the first layer, and the selection of errors will generate a cascade effect, so that the estimation error of the entire network increases more and more with the number of layers, and therefore, in order to further improve the accuracy of the constructed sub-causal graph, an analysis model is constructed by combining the notearsm mlp algorithm, which reduces the transfer of errors by differential and network rsp modeling, and solves the inference of high-dimensional causal graph.

Then, the process of constructing the corresponding sub-causal graph according to the index set by using the pre-established causal analysis model may specifically include:

Specifically, for the index set, various Lingam algorithms can be respectively adopted to construct sub-causal graphs based on the index set, each sub-causal graph comprises a plurality of causal chains, each causal chain has a corresponding edge weight, each node also has a corresponding node weight, and then each sub-causal graph is merged to obtain a final sub-causal graph, so that the accuracy of the constructed sub-causal graph can be improved, and the accurate positioning of the root cause in the subsequent process is facilitated. For example, for the index set a, an ICALingam algorithm is used to construct the sub-causal graph A1, a dielingam algorithm is used to construct the sub-causal graph A2, and a NotetearsMLP algorithm is used to construct the sub-causal graph A3, and then the sub-causal graphs (the sub-causal graph A1, the sub-causal graph A2, and the sub-causal graph A3) are combined to obtain a final sub-causal graph a corresponding to the index set a.

As can be seen from the above, each type of index set of each node corresponds to one final sub-causal graph, and each final sub-causal graph is combined by the sub-causal graphs corresponding to the different algorithms in the above method.

For example, the service layer performance indexes of the node 1 are classified to obtain multiple types of index sets which are respectively an index set 1A, an index set 1B, an index set 1C and an index set 1D, the index set a is constructed by the method to obtain a final sub-causal graph 1A, the index set 1B is constructed by the method to obtain a final sub-causal graph 1B, the index set 1C is constructed by the method to obtain a final sub-causal graph 1C, and the index set 1D is constructed by the method to obtain a final sub-causal graph 1D. Aiming at the index set nA, the index set nB, the index set nC and the index set nD of the node n, the final sub-causal graphs constructed by the method are the sub-causal graph nA, the sub-causal graph nB, the sub-causal graph nC and the sub-causal graph nD respectively.

Further, the process of combining the sub-cause-and-effect graphs corresponding to each Lingam algorithm to obtain a final sub-cause-and-effect graph may specifically include:

and combining the sub-causal graphs corresponding to each Lingam algorithm, and pruning the causal chain of each sub-causal graph in the combining process to obtain the final sub-causal graph.

It should be noted that in practical applications, the logic rationality of each cause and effect chain in each sub-cause and effect diagram can be confirmed by expert experience, and since some invalid redundant cause and effect chains exist in the sub-cause and effect diagrams, some invalid redundant cause and effect chains need to be pruned in the process of merging the sub-cause and effect diagrams corresponding to a certain index type, so as to improve the quality of the merged final sub-cause and effect diagram.

Further, the pruning process for the causal chain of each sub-causal graph may specifically include:

and pruning the causal chain of each sub-causal graph by adopting a frequent item mining method and maximum entropy thresholding.

Specifically, in the process of merging the sub-causal graphs, a frequent item mining method and maximum entropy thresholding can be adopted to prune causal chains of each sub-causal graph, wherein the frequent item mining method is mainly used for analyzing the causal chains in each sub-causal graph based on an FP-growth algorithm, the most common causal link combination is selected, and the frequent mode sequence and the result stability of the causal chains are ensured by constructing an information compression tree. In addition, the maximum entropy thresholding is specifically based on the causal matrix connection strength _bij And (3) thresholding, namely secondary pruning, wherein the indexes still have dimensional difference, OTSU is adopted for threshold segmentation, an optimal threshold is automatically generated according to connection distribution, so that the inter-class variance is maximized, and the average value of a plurality of causal chain thresholds is taken as the final causal chain weight.

Further, the process of establishing the fault propagation map according to the configuration information and the abnormal alarm information of the abnormal node in S130 may specifically include:

constructing a network service call graph according to the request and call relation among the services and the resource configuration graph;

It should be noted that the fault propagation graph is constructed based on the abnormal alarm and the physical topology, and is mainly used for capturing fault propagation among the cluster, the host, and the virtual machine. When a fault propagation diagram is constructed in the embodiment of the present invention, a resource Configuration diagram is first constructed according to Configuration information of an abnormal node, where the Configuration information is obtained based on a Configuration Management Database CMDB (Configuration information), and the Database is used to store resource Configuration information of all key components (including hardware, software, and services provided by a system) in the system, and record update history, event history, relationship information, and the like of the components. By analyzing the CMDB data, organization views of different configuration items and the interrelations thereof in the system can be determined, and a resource configuration diagram is established.

In addition, in order to supplement the resource configuration diagram architecture, the association between the components in the resource configuration diagram is further perfected, and the request and call relation between services can be further obtained, wherein the service call logic can be constructed by various technologies (such as PrecisTrace technology), the request and call relation between system application services can be tracked, the tracking can be requested without decoding source codes, and accurate capture and quick response can be realized by detecting the kernel of the operating system. Service call can be captured through flow access between the virtual machines, flow abnormity can be monitored, and a network service call graph is constructed by combining a resource configuration graph. And then constructing a fault propagation graph based on the abnormal alarm information and the network service call graph.

Further, the process of the component fault propagation map based on the abnormal alarm information and the network service call map may specifically include:

performing alarm suppression on the abnormal alarm information;

It should be noted that, because the abnormal alarm information often has the problems of false alarm and missed alarm, and the data volume is usually large, in order to further improve the quality of the constructed fault propagation diagram, in the embodiment of the present invention, alarm suppression may be performed on the abnormal alarm information, the alarms of the same type in the same time are merged through alarm classification, and the alarm degree is scored, and the intensive merged alarm has a more serious abnormal degree than the sparse alarm. After the alarm suppression, obtaining various types of combined abnormal alarm information, wherein each type of abnormal alarm information has a corresponding alarm degree score, and in order to further determine an effective alarm, checking the various types of abnormal alarm information after the alarm suppression by adopting a work order abnormal event corresponding to an abnormal node so as to determine the effective abnormal alarm, reserving the alarm degree and the corresponding score of the effective abnormal alarm, and taking the score as the abnormal score of a fault propagation node.

It should be further noted that, when constructing the fault propagation graph, each system component (e.g., host, virtual machine, application, port) corresponds to each heterogeneous node in the graph, the deployment configuration relationship between each base device is represented as an undirected connection and is added to the graph, and each application program or traffic ingress and egress is represented as a directed connection, which represents a call B.

When the fault propagation in the fault propagation graph is determined, due to the fact that the abnormal nodes show correlation, a triple group can be formed by calculating correlation coefficients between index data of adjacent nodes in the fault propagation graph and index data of historical fault nodes serving as thresholds, and therefore the fault propagation rate between the two nodes can be measured. If the node belongs to the application program, selecting service calling, flow and storage time sequence data as characteristics to carry out similarity calculation; if the node belongs to the type of the host or the virtual machine, indexes such as resource utilization rate and the like can be selected to calculate the node similarity. Specifically, the similarity calculation mainly considers the Pearson correlation coefficient between index data in an abnormal time period, gives an abnormal time window through abnormal alarm, and takes the average value of the correlation coefficients of the indexes as the final similarity between two nodes for different indexes such as CPU (Central processing Unit), memory consumption and network throughput.

Specifically, the higher the similarity is, the higher the probability of abnormal propagation occurring between two adjacent nodes is, but it is not excluded that the two nodes are originally related in a normal state, and the correlation does not necessarily mean that there is a causal relationship of occurrence of a fault. Therefore, the high similarity is not a sufficient condition for estimating the failure root cause, and it is necessary to have a strong abnormality propagation capability as the failure root cause node because the false alarm of the failure root cause is caused only by using the similarity.

Further, in practical applications, the root cause information may include: root cause indicators and root cause components;

then, the process of performing root cause location based on the pre-established indicator cause-and-effect map, the service layer performance indicator and the fault propagation map in S140 to obtain root cause information may specifically include:

screening out performance indexes corresponding to the root factor components from the performance indexes of the service layer;

and analyzing the abnormal propagation based on the abnormal indexes and the index causal graph to determine the root index. Specifically, the method may include analyzing fault propagation of component devices in a fault propagation graph by using a deep walk algorithm, so as to determine root cause components, where the determined root cause components may be one or multiple, when the determined root cause indexes are one, a component with the largest influence degree is used as the root cause component, when the determined root cause indexes are multiple, a preset number of components with the highest influence degree before the located influence degree is ranked may be used as all root cause components according to a preset number, and for each root cause component, each performance index corresponding to the root cause component is selected from the service layer performance indexes, and then abnormality detection is performed on the performance indexes to obtain abnormal indexes, and then the abnormal degree of the abnormal index may be scored, so as to determine an abnormal score of the abnormal index, the abnormal score represents a deviation degree of the abnormal index from a normal index by using the abnormal score, and the causal index graph are analyzed by using a PageRank rank algorithm, so as a root cause index, and a final root cause index or a root cause index is determined, and the root cause index is used as a candidate for forming a root cause component.

For example, if the located component with the largest influence degree is the host, the host is used as the root cause component, the index of the root cause component is located, the determined abnormal indexes 2 before the influence degree rank are the host cpu and the host memory respectively, and the finally determined root cause information is the root cause candidate set formed by the host-host cpu and the host-host memory.

In practical application, under the condition that the root cause component and/or the root cause index are multiple, the root cause link can be obtained based on the root cause index and the root cause component, so that the cause-effect relationship among the root causes can be clearly and concisely displayed for operation and maintenance personnel to refer to, the subsequent fault repairing time can be favorably shortened, and the overall operation and maintenance efficiency can be improved. Among them, as shown in fig. 5, the root cause link, the index mem. Bytes. Used. Percentage affects the index cpu. Loadg.1, the index cpu. Loadg.1 affects the index cpu. Softirq, the index cpu. Softirq affects the index cpu.idle, and the index cpu.idle affects the index cpu.sys.

It should be further noted that, in the foregoing description, the failure propagation of the component device in the failure propagation graph may be analyzed by using a deep walk random walk algorithm in the embodiment of the present invention, where deep walk is a method for generating a link by random walk on a graph, and the random walk process is used to simulate and track an abnormal propagation process of a failure node, and the walk mainly includes three actions: forward transfer, reverse transfer, and origin dwell.

For the case of forward migration, node a may propagate an exception to node B during the failure period, and in the random walk process, the higher the similarity, the higher the probability of transferring from a to B, i.e. the node tends to walk to a two-hop node with high similarity. For the case of reverse migration, if a child node with low similarity to the previous node is walked, a false alarm will be generated, and the node is allowed to return to the previous node and reselect a two-hop node according to a certain probability. In the walking process, the nodes which tend to stay are often root causes of faults, namely the higher the statistical stay probability of a certain node is, the higher the probability that the node is the root cause is, and the abnormal phenomenon of the node can better explain the abnormal phenomena of all other nodes. In practical application, stable fault node output can be provided through frequent item mining of the walking sequence and a voting mechanism depending on expert experience.

In addition, before S110, the method includes screening abnormal nodes from each node of the cluster when it is detected that the cluster is abnormal, where specifically, when it is detected that the cluster is abnormal, configuration information of the cluster and abnormal alarm information of the cluster are obtained, then the abnormal node is determined based on the configuration information and the abnormal alarm information, specifically, a corresponding fault propagation diagram may be established based on the configuration information and the abnormal alarm information of the cluster, then a fault propagation analysis is performed on the fault propagation diagram of the cluster, and the abnormal node is determined, where a random walk algorithm may be used to analyze the fault propagation diagram of the cluster, and locate the abnormal node, and then root cause location is performed on each abnormal node by using the above method.

On the basis of the above embodiment, an embodiment of the present invention further provides a fault root cause positioning apparatus, referring to fig. 6, the apparatus includes:

an obtaining module 11, configured to obtain a service layer performance index of an abnormal node;

the establishing module 12 is used for establishing a fault propagation diagram according to the configuration information and the abnormal alarm information of the abnormal node; the fault propagation graph comprises fault propagation relationships between the components;

the positioning module 13 is configured to perform root cause positioning based on the indicator cause-and-effect graph, the service layer performance indicator, and the fault propagation graph, so as to obtain root cause information; the index causal graph is established based on service layer performance indexes of nodes in the cluster, and comprises causal relations among the indexes.

It should be noted that the fault root cause positioning apparatus provided in the embodiment of the present invention has the same beneficial effects as the fault root cause positioning method provided in the above embodiment, and for the specific description of the fault root cause positioning method related in the embodiment of the present invention, reference is made to the above embodiment, and the description of the present invention is omitted here.

On the basis of the above embodiments, the embodiment of the present invention further provides a fault root cause positioning system, specifically referring to fig. 7. The system comprises:

a memory 20 for storing a computer program;

the processor 21 is configured to implement the steps of the fault root cause locating method when executing the computer program.

It should be noted that the fault root cause positioning system provided in this embodiment may be disposed on an electronic device, and the electronic device may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.

The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can implement the relevant steps of the fault root cause locating method disclosed in any one of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, windows, unix, linux, and the like. The data 203 may include, but is not limited to, a set offset, etc.

In some embodiments, the fault root cause location system may further include a display screen 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.

Those skilled in the art will appreciate that the configuration shown in FIG. 7 does not constitute a limitation of a fault root cause location system and may include more or fewer components than those shown.

It is to be understood that, if the fault root cause locating method in the above embodiments is implemented in the form of a software functional unit and sold or used as a stand-alone product, it may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be substantially or partially implemented in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods of the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrically erasable programmable ROM, a register, a hard disk, a removable magnetic disk, a CD-ROM, a magnetic or optical disk, and other various media capable of storing program codes.

Based on this, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the fault root cause locating method are implemented.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A fault root cause positioning method is characterized by comprising the following steps:

acquiring a service layer performance index of an abnormal node;

carrying out root cause positioning based on a pre-established index cause and effect diagram, the service layer performance index and the fault propagation diagram to obtain root cause information; the index causal graph is established based on service layer performance indexes of nodes in the cluster, and comprises causal relations among the indexes.

2. The method of claim 1, wherein the indicator cause-effect graph is established based on service layer performance indicators of nodes within a cluster, and comprises:

merging the sub-causal graphs of the nodes belonging to the same index type to obtain a causal graph corresponding to each index type;

and constructing an index causal graph according to each causal graph.

3. The fault root cause positioning method according to claim 2, wherein the causal analysis model is established based on a plurality of linear non-gaussian causal Lingam algorithms;

4. The method for locating the fault root cause according to claim 3, wherein the merging the sub-causal graphs corresponding to each Lingam algorithm to obtain a final sub-causal graph comprises:

5. The method of claim 4, wherein the pruning the causal chain of each of the sub-causal graphs comprises:

6. The method according to claim 1, wherein the establishing a fault propagation map according to the configuration information and the abnormal alarm information of the abnormal node comprises:

7. The method according to claim 6, wherein the component fault propagation map based on the abnormal alarm information and the network service call map comprises:

performing alarm suppression on the abnormal alarm information;

8. The method according to any one of claims 1 to 7, wherein the root cause information includes: root cause indicators and root cause components;

9. A fault root cause locating device, comprising:

the acquisition module is used for acquiring the service layer performance index of the abnormal node;

the positioning module is used for carrying out root cause positioning on the basis of the index cause and effect diagram, the service layer performance index and the fault propagation diagram to obtain root cause information; the index causal graph is established based on service layer performance indexes of nodes in the cluster, and comprises causal relationships among the indexes.

10. A fault root cause location system, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method for fault root cause localization according to any of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for fault root cause localization according to any of claims 1 to 8.