CN114661504A - Operable and interpretable root cause positioning method for repeated occurrence type faults - Google Patents

Operable and interpretable root cause positioning method for repeated occurrence type faults Download PDF

Info

Publication number
CN114661504A
CN114661504A CN202210168856.4A CN202210168856A CN114661504A CN 114661504 A CN114661504 A CN 114661504A CN 202210168856 A CN202210168856 A CN 202210168856A CN 114661504 A CN114661504 A CN 114661504A
Authority
CN
China
Prior art keywords
fault
root cause
trained
instance
faults
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210168856.4A
Other languages
Chinese (zh)
Inventor
裴丹
李则言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202210168856.4A priority Critical patent/CN114661504A/en
Publication of CN114661504A publication Critical patent/CN114661504A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses an operable and interpretable root cause locating method for repeated occurrence type faults, wherein the method comprises the following steps: utilizing a monitoring system to monitor faults, and triggering a trained root cause positioning model based on the faults monitored by the monitoring system, wherein the trained root cause positioning model carries out global interpretation through a trained decision tree; obtaining a root cause fault example according to index data corresponding to the fault example when the fault occurs and the fault dependency graph, and locally explaining a positioning result of the root cause fault example; wherein the fault dependency graph is constructed according to the fault instance. The method and the device can realize operable and interpretable root cause determination, can directly obtain how to perform repairing operation according to the positioning result, and can accurately position the result.

Description

Operable and interpretable root cause positioning method for repeated occurrence type faults
Technical Field
The present invention relates to the field of fault diagnosis and root cause location technology, and more particularly, to an operable and interpretable root cause location method and apparatus for recurring type faults.
Background
Online service systems, such as internet banking and search engines, have become an integral part of our lives. An online service consists of a myriad of components, such as load balancers, Web servers, containers, databases, and the like. The components work together and call each other to provide services, and complex dependency relationships exist among the components. Due to the large size and high complexity of online service systems. The failure of the on-line service system is inevitable, and the failure can cause huge economic loss and reduction of user satisfaction. Therefore, root cause location is critical in order to reduce the negative impact caused by failures.
In an actual online service system, most failures are of the repetitive type. For example, over 94% of failures are caused by a small number of repeated types of root causes. Examples of such repeat type failures include high service response times due to lack of indexing of corresponding fields in the database, reduced service success rates due to unavailability of third party services, and so forth. Therefore, it is very important to locate the repetitive type of failure.
In order to perform fault diagnosis and root cause location on the online service system, various types of monitoring data are continuously collected. Where the performance indicator data is a continuous monitoring of its key performance indicators on each component, stored in a time series. Compared with log data, the type of the index data is single, the analysis is simple, the data volume is small, and the abnormal state of the component can be reflected sufficiently. The index data can reflect underlying performance indicators (e.g., CPU, memory, etc.) as compared to the call chain data.
In existing practice, experienced operation and maintenance personnel have accumulated consciousness from the troubleshooting experience of the aged years since the root cause of the repetitive type of failure was located. Experienced operation and maintenance personnel can guess what a fault may be from the symptoms of the online service system, such as the form of the indicators, and make further confirmation based thereon. A root cause may cause problems for a series of components due to complex dependencies between components. The operation and maintenance personnel, based on their experience, can analyze the relationship between such correlated anomalies to find the true root cause. On the other hand, after finding the root cause of the fault, the operation and maintenance personnel can find reference from the repair experience of other similar faults in history, find a more accurate repair scheme and execute repair operation more quickly.
At present, the problem of fault root cause service location is modeled into a multi-classification (each class is a possible root cause service) problem through a machine learning method. The data used is based on the call chain, the global configuration of the system, the index of the service, etc. Training a multi-classification model based on historical fault data, wherein a specific machine learning algorithm adopts random forests, multilayer perceptrons or k-nearest neighbors and the like. Then when a fault occurs, it is given by the trained model which service is more likely to be the root cause service. The granularity of the root located by the MEPFL is a root service and there is no explanation of the classification result.
In the existing method for indirectly positioning root causes by searching for similar faults in history, only the similar historical faults are recommended, and the obtained root causes of the similar historical faults are possibly the root causes of the current faults. Each fault is modeled into a fault graph, nodes of the fault graph are components in the system, node characteristics are various key indexes of the components, and edges are deployment relations among the components.
For the existing method, there is no operability, and the granularity of the existing positioning method, which is determined by the existing root cause positioning method, is either too coarse (e.g. service) or too fine (e.g. single index), which is not enough for operation and maintenance personnel to directly determine how to take the repair operation from the positioning result. The existing root cause positioning method lacks interpretability, and operation and maintenance personnel cannot understand how the algorithm gives the root cause. The interpretability referred to herein includes two aspects: local interpretation, i.e. how the localization result of a single fault is given; global interpretation, i.e. the logic that interprets the entire localization method or model.
Content of application
The present application is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the invention aims to provide an operable and interpretable root cause positioning method for repeated faults, which can realize operable and interpretable root cause positioning, wherein the operability enables operation and maintenance personnel to directly obtain how to perform repair operation from the positioning result of the application, and the interpretability enables the operation and maintenance personnel to understand and trust the positioning result given by the automatic method of the application.
It is another object of the present application to provide an operatively interpretable root cause locating device for recurring type faults.
To achieve the above object, the present application provides, in one aspect, an operably interpretable root cause locating method for recurring type faults, comprising the steps of:
utilizing a monitoring system to monitor faults, and triggering a trained root cause positioning model based on the faults monitored by the monitoring system; wherein, the trained root cause positioning model carries out global explanation through the trained decision tree; obtaining a root cause fault example according to index data corresponding to the fault example when the fault occurs and the fault dependency graph, and locally explaining a positioning result of the root cause fault example; wherein the fault dependency graph is constructed according to the fault instance.
The method of the embodiment of the application can realize operable and interpretable root cause positioning, the operability enables operation and maintenance personnel to directly obtain how to perform repair operation from the positioning result of the application, and the interpretability enables the operation and maintenance personnel to understand and trust the positioning result given by the automation method of the application.
In addition, the operationally interpretable root cause locating method for recurring type faults according to the above-described embodiments of the present application may also have the following additional technical features:
further, in an embodiment of the application, a loss function preset by a random gradient descent method is used, and a root cause positioning model is trained based on historical fault data to obtain the trained root cause positioning model.
Further, in one embodiment of the present application, the root cause localization model includes three components, namely, a feature extractor, a feature aggregator and a classifier, and then each fault instance is represented as a vector of fixed length by using the feature extractor; wherein the vector is an instance level feature; aggregating, with the feature aggregator, the instance-level features of the associated fault instances of each fault instance based on the structure of the fault dependency graph to obtain aggregated features; and scoring based on the aggregated features by using the classifier to design the root cause localization model according to the three components.
Further, in an embodiment of the present application, the loss function preset by using a random gradient descent method includes:
measuring output fraction s of fault instance v of fault T by using binary cross entropy BCET(v) And a genuine label rT(v) The difference between (a) and (b):
BCE(rT(v),sT(v))=rT(v)·log(sT(v))+(1-rT(v))·log(1-sT(v))
carrying out weighted average on BCEs of different fault instances by using additional weight to obtain the preset loss function, wherein the preset loss function is LsAnd then:
Figure BDA0003517671250000031
wherein the content of the first and second substances,
Figure BDA0003517671250000032
is a training set, TiRepresenting a fault in a training set, NHIs the size of the training set, V represents all fault instances,
Figure BDA0003517671250000033
further, in an embodiment of the present application, the interpreting the positioning result of the root cause fault instance by using a local interpretation manner includes: presetting two faults, namely a first fault and a second fault; wherein the first fault is a current fault to be interpreted and the second fault is a historical fault; for each of the first faults, comparing the distance of the aggregated characteristic thereof to the aggregated characteristic of each of the second faults belonging to the same fault category, and taking the smallest one as the distance of the fault instance and the second fault; and carrying out weighted average on the distances from all fault instances in the first fault to the second fault, and comparing a distance formula between the two faults to explain the positioning result of the root fault instance.
Further, in one embodiment of the present application, the first is presetFailure T1And a second failure T2The distance between the two faults is formulated as:
Figure BDA0003517671250000034
wherein N isc(v) Indicating that all fault instances v belong to the same fault category,
Figure BDA0003517671250000035
indicates a fault T1The aggregated characteristics of the medium-fault instance v,
Figure BDA0003517671250000041
indicates a fault T2Aggregation feature of medium fault instance v' | · | | non-woven1Representing the L1 norm.
Further, in an embodiment of the present application, the global interpretation of the trained root cause localization model by the trained decision tree includes: extracting a first time sequence characteristic from the index data by utilizing a plurality of time sequence curve characteristic extractions; selecting the first time sequence feature by using feature extraction based on index reconstruction to obtain a second time sequence feature; training a decision tree based on the second time sequence characteristics to obtain the trained decision tree, and carrying out root judgment on the fault instance by utilizing the trained root cause positioning model according to the trained decision tree; wherein the rules in the trained decision tree are used as a global interpretation of the trained root cause localization model.
Further, in an embodiment of the present application, the feature extraction of the index reconstruction includes: taking the feature extractor and the feature aggregator of the trained root cause positioning model as an encoder part of an automatic encoder, and freezing parameters of the encoder part; and, constructing a decoder section consisting of a deconvolution layer and a full-link layer; training the auto-encoder to minimize its reconstruction error; and comparing values of the first time sequence characteristic in the original index data and the index data reconstructed by the automatic encoder, and screening the characteristic if the difference value is larger than a preset value.
Further, in an embodiment of the present application, the extracting of the plurality of timing curve features includes: a plurality of mean, standard deviation, range count, maximum, minimum, autocorrelation coefficient, and peak count.
To achieve the above object, the present application provides, in another aspect, an operatively interpretable root cause locating apparatus for recurring type faults, including:
the fault triggering module is used for monitoring faults by using a monitoring system and triggering a trained root cause positioning model based on the faults monitored by the monitoring system; wherein the trained root cause positioning model carries out global interpretation through a trained decision tree; the root cause positioning module is used for obtaining a root cause fault example according to the index data corresponding to the fault example when the fault occurs and the fault dependency graph, and locally explaining the positioning result of the root cause fault example; wherein the fault dependency graph is constructed from the fault instances.
The operable and interpretable root cause positioning device for the repeated fault of the type can realize operable and interpretable root cause positioning, the operability enables operation and maintenance personnel to directly obtain how to carry out repair operation from the positioning result of the application, and the interpretability enables the operation and maintenance personnel to understand and trust the positioning result given by the automation method of the application.
The beneficial effect of this application does:
1) based on the root cause positioning framework, operable and interpretable root cause positioning can be achieved, operation and maintenance personnel can directly obtain how to perform repair operation from positioning results through operability, and the interpretability enables the operation and maintenance personnel to understand and trust the positioning results given by the automation method.
2) The model of the application can provide uniform coding for fault instances of any fault category, so that the fault instances can be uniformly further analyzed; the dependency relationship between the fault instances can be considered, so that the root cause can be still accurately positioned in the fault propagation scene; can be generalized to unseen fault instances
3) The interpretable method provides a local interpretation and a global interpretation of the fault localization model. On one hand, how the positioning result of each fault is given is explained, and on the other hand, the explained whole model learns from historical data, so that operation and maintenance personnel can understand and believe the positioning result of the application.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of an operatively interpretable root cause locating method for recurring type faults according to an embodiment of the present application;
FIG. 2 is a diagram of a root cause positioning architecture according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a root cause location model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an operably interpretable root cause locator for recurring type faults according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An operatively interpretable root cause locating method and apparatus for a recurring type fault, which are proposed according to embodiments of the present application, will be described below with reference to the accompanying drawings, and first, an operatively interpretable root cause locating method for a recurring type fault, which is proposed according to embodiments of the present application, will be described with reference to the accompanying drawings.
FIG. 1 is a flow diagram of an operatively interpretable root cause locating method for recurring type faults according to one embodiment of the present application.
As shown in fig. 1, the method includes, but is not limited to, the steps of:
step S1, utilizing the monitoring system to monitor the fault, and triggering the trained root cause positioning model based on the fault monitored by the monitoring system; wherein, the trained root cause positioning model carries out global explanation through the trained decision tree;
step S2, obtaining a root cause fault instance according to the index data corresponding to the fault instance when the fault occurs and the fault dependency graph, and locally explaining the positioning result of the root cause fault instance; wherein the fault dependency graph is constructed according to the fault instance.
The embodiments of the present application are further explained with reference to the drawings.
FIG. 2 is a block diagram of the architecture of the present application for root cause location, which is operable and interpretable according to the architecture of the present application. The operability enables the operation and maintenance personnel to directly obtain how to perform the repair operation from the positioning result of the application, and the interpretability enables the operation and maintenance personnel to understand and trust the positioning result given by the automatic method of the application.
First, the granularity of the positioning of the present application is explained, and it is understood that the granularity of the positioning of the present application is a failure example in order to achieve operability. A fault instance is a set of metrics on a component. For example, the gold indicators (transaction amount, success rate, response time) on the Service1 are one example of failure; the CPU-related metrics (e.g., CPU utilization, wio, etc.) on Container1 are one example of a fault. A group of fault examples formed by similar indexes on similar components is called a fault category. For example, CPU related metrics on a container constitute a failure category, and these metrics on any particular container are an instance of a failure in this failure category.
When an operation and maintenance person determines that a set of indicators on a component are root causes, the operation and maintenance person is sufficient to determine what problem has occurred with the component and what repair actions should be taken further. Thus, the combination of such components and metrics is defined as a fault instance and a fault category. When the operation and maintenance personnel locate that a fault instance thus defined is the root cause of a fault, they can determine what repair action should be taken. Therefore, by setting the positioning granularity as a fault instance, the positioning framework of the present application has operability.
Unlike prior methods, the fault instance is more granular than the components and individual indicators, because the present application does not locate only the components and indicators corresponding to the fault instance, but rather determines their combination to constitute the root cause. The number of corresponding types of indicators and the pattern of indicators in different fault categories are different.
In order to model complex dependencies in an online service system, all fault instances in a system are organized into a fault dependency graph. The nodes on the fault dependency graph are fault instances, the edges represent correlation relations among the fault instances, and the fault dependency graph is an undirected graph.
When there is a calling relationship, or there is a deployment relationship, or there is a causal relationship between the indicators, there is an edge between the two fault instances on the fault dependency graph. For example, service A calls service B, then there is an edge for the instance of the golden index component of service A and the instance of the golden index component of service B. If service A is deployed on container C, then both the instances of service A's golden index composition and the instances of container C's various index composition failures have edges.
The positioning model is a deep learning model, and the input of the positioning model is indexes corresponding to all fault instances when a fault occurs and a fault dependency graph. The output of which is a score for each instance of the fault. The value range of the score is [0,1], and the larger the score is, the more likely a fault instance is a root cause. As shown in fig. 3, fig. 3 is a structure of the root cause localization model of the present application.
It will be appreciated that for any one fault instance v, its input A()Is M on the failure instancevThe values of the w latest time points when the fault occurs are taken as the indexes. Namely A(v)Is of a shape w × MvOf the matrix of (a). In practice, w is generally 20. The input of the whole model is the matrix A corresponding to all fault instances and the fault dependency graph.
The localization model of the present application contains three components, a feature extractor, a feature aggregator, and a classifier. The feature extractor is responsible for representing each fault instance as a fixed-length vector, called instance-level feature. The feature aggregator is responsible for aggregating the instance-level features of the associated fault instances of each fault instance together to form an aggregated feature based on the structure of the fault dependency graph. Finally, the classifier scores based on the aggregated characteristics of each fault instance. Because the aggregated features already contain the features of the fault dependency graph and the related instances, the classifier can also consider the propagation relationship of the fault between the instances based on the aggregated features of only a single instance.
As an example, the feature extractor is a neural network gru (gated recurrent unit) layer composed of three layers, a convolutional layer and a fully-connected layer. GRU is a recurrent neural network through which to derive A(v)And extracting the time sequence information of the index. Then, through the convolutional layer and the full-link layer, the correlation between different time points and different indexes can be modeled.
It will be appreciated that because the number of metrics contained in the fault instances in each fault category is different, one fault extractor module is used for each fault category. The input dimensions (corresponding to the number of indices) of these fault extractors are different, but all other configurations, including the output dimensions, are the same. Thus ensuring that for any fault instance, it is obtained by the fault extractorExample level features of the same dimension. Recording the corresponding example level characteristic of the fault example v as f(v)
As an example, the feature aggregator employs a multi-headed multi-tiered graph attention network (GAT). For each GAT, mapping the instance-level characteristics of each fault instance to a new space through a neural network, then calculating the weight of each pair of the prior connected fault instances on the fault dependency graph, and finally carrying out weighted average on the mapped characteristics of the neighbors of each fault instance on the fault dependency graph to obtain the aggregation characteristics of the fault instance. Recording the aggregation characteristics corresponding to the fault instances v as
Figure BDA0003517671250000071
To improve the model capability of the feature aggregator, multiple GATs (multi-headed GATs) are stacked in parallel and their outputs are then stitched as an aggregated feature for each fault instance. On the other hand, multiple layers of multi-headed GATs are serially stacked, i.e., the output of a multi-headed GAT on a previous layer is taken as the input of a next layer, to model dependencies over one hop.
As an example, the classifier employs a two-layer fully connected network. The final output layer uses a sigmod function to ensure that the fraction of the output belongs to [0,1 ]. For a fault instance v, let its classifier output be s (v) e [0,1 ].
The training and positioning of the model of the application are elaborated, and the application can be divided into an off-line training part and an on-line positioning part.
In the off-line training part, the proposed root cause positioning model is trained periodically (for example, weekly and monthly) based on historical fault data, and the trained positioning model is interpreted through a global interpretation method to show what the trained positioning model learns for the operation and maintenance personnel.
In the online positioning part, when a fault occurs, the trained positioning model is triggered. After triggering, the trained root cause positioning model gives a root cause fault example based on the current index data and the fault dependency graph, and simultaneously gives an explanation of the positioning result by a local explanation method.
As an example, the model of the application is trained by a method for minimizing the loss function by a random gradient descent method commonly used for training a neural network. However, in order to apply the stochastic gradient descent method, one key point is how to design the corresponding loss function.
The core idea of the loss function of the application is to measure the output fraction s of each fault instance v of each fault (denoted as T)T(v) And a genuine label rT(v) The difference in (a). If v is the root cause failure instance of T, then rT(v) 1, otherwise rT(v)=0。
The difference between each pair is first measured using Binary Cross Entropy (BCE):
BCE(rT(v),sT(v))=rT(v)·log(sT(v))+(1-rT(v))·log(1-sT(v))
then, since the number of failed instances and non-failed instances for each fault is very different, for balancing, the BCEs of the different fault instances are weighted averaged using additional weights to get the final loss function.
Noting the loss function as LsThen, then
Figure BDA0003517671250000081
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003517671250000082
is a training set, TiRepresenting a fault in a training set, NHIs the size of the training set, V represents all fault instances,
Figure BDA0003517671250000083
as an example, the present application explains how the current fault model is located by finding similar historical faults. For this purpose, a method is proposed for comparing the distance between two faults.
Given twoA fault T1And T2Wherein T is1Is the current fault to be explained, T2Is a historical failure. For T1Comparing its aggregate characteristics with T for each fault instance in (1)2Of each fault instance belonging to the same fault class, and the smallest one as the fault instance and T2The distance of (c). Then to T1To T of all fault instances2The distance of (a) is weighted and averaged, and the weight is the model score corresponding to the fault instance. The above distance calculation method can be formalized as:
Figure BDA0003517671250000084
wherein N isc(v) Indicating that all instances of the fault belong to the same fault category as v,
Figure BDA0003517671250000085
indicates a fault T1The aggregated characteristics of the medium-fault instance v,
Figure BDA0003517671250000086
indicates a fault T2Aggregation feature of medium fault instance v' | · | | non-woven1Representing the L1 norm.
Thus, for a current fault T to be interpreted1Calculating T by a function d1And the distance of each historical fault, and finding the historical fault with the minimum distance as a similar historical fault.
Further, the trained model is globally interpreted by training decision trees that can mimic the neural network output. A decision tree is a rule-based interpretable machine learning model through which it can be seen what root cause positioning rules a trained model may learn.
In order to train the decision tree, the timing characteristics need to be extracted from the indicators first. Several classical time series curve feature extraction methods are selected, including mean, standard deviation, range count, maximum, minimum, autocorrelation, peak count, etc.
The number of the time sequence features extracted from the indexes is much larger, so that the efficiency and the accuracy of the decision tree training are seriously influenced. It is desirable to be able to select features from which a trained localization model may be concerned. Therefore, a characteristic extraction method based on index reconstruction is designed. First, the feature extractor and feature aggregator of the trained model are taken out as the encoder part of an auto encoder (auto encoder) and its parameters are frozen. Then, a decoder section is constructed consisting of the deconstruction layer and the full connection layer. The automatic encoder is trained to enable the reconstruction error of the automatic encoder to be minimum, then the value of each feature in the original index and the index after reconstruction through the automatic encoder is compared, and if the difference is large, the feature is screened out.
The decision tree is then trained based on the selected features. The input to the decision tree is the timing characteristic of the selected indicator for one fault instance. The training target output is the judgment of the trained positioning model on whether the fault instance is a root cause. The rules in the trained decision tree are used as a global explanation for the trained positioning model.
By the operable and interpretable root cause positioning method for the repeated occurrence type fault, operable and interpretable root cause determination can be achieved, how to perform repair operation can be directly obtained according to the positioning result, and the result can be accurately positioned.
In order to implement the above-described embodiment, as shown in fig. 4, there is also provided in the present embodiment an operationally interpretable root cause locating apparatus 10 for recurring type faults, the apparatus 10 including: a rule learning module 100 and a root cause interpretation module 200.
The fault triggering module 100 is configured to perform fault monitoring by using a monitoring system, and trigger a trained root cause positioning model based on a fault monitored by the monitoring system; wherein, the trained root cause positioning model carries out global explanation through the trained decision tree;
the root cause positioning module 200 is configured to obtain a root cause failure example according to the index data and the failure dependency graph corresponding to the failure example when the failure occurs, and locally explain a positioning result of the root cause failure example; wherein the fault dependency graph is constructed according to the fault instance.
The operable and interpretable root cause positioning device for the repeated fault can realize operable and interpretable root cause positioning, can directly obtain how to perform repair operation according to the positioning result, and can accurately position the result.
It should be noted that the explanation of the foregoing embodiment of the method for locating an operably interpretable root cause of a recurring type fault is also applicable to the apparatus for locating an operably interpretable root cause of a recurring type fault of this embodiment, and is not repeated herein.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are exemplary and should not be construed as limiting the present application and that changes, modifications, substitutions and alterations in the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. A method of locating an operably interpretable root cause of a recurring type fault, comprising the steps of:
utilizing a monitoring system to monitor faults, and triggering a trained root cause positioning model based on the faults monitored by the monitoring system; wherein the trained root cause positioning model carries out global interpretation through a trained decision tree;
obtaining a root cause fault example according to the index data corresponding to the fault example when the fault occurs and the fault dependency graph, and locally explaining the positioning result of the root cause fault example; wherein the fault dependency graph is constructed from the fault instances.
2. The method according to claim 1, wherein a loss function preset by a stochastic gradient descent method is used, and a root cause positioning model is trained based on historical fault data to obtain the trained root cause positioning model.
3. The method of claim 1, wherein the root cause localization model comprises three components, a feature extractor, a feature aggregator, and a classifier,
representing each fault instance as a fixed-length vector with the feature extractor; wherein the vector is an instance-level feature;
aggregating, with the feature aggregator, the instance-level features of the associated fault instances of each fault instance based on the structure of the fault dependency graph to obtain aggregated features;
and scoring based on the aggregated features by using the classifier to design the root cause localization model according to the three components.
4. The method according to claim 2, wherein the loss function preset by using a stochastic gradient descent method comprises:
measuring output fraction s of fault instance v of fault T by using binary cross entropy BCET(v) And a genuine label rT(v) The difference between (a) and (b):
BCE(rT(v),sT(v))=rT(v)·log(sT(v))+(1-rT(v))·log(1-sT(v))
carrying out weighted average on BCEs of different fault instances by using additional weight to obtain the preset loss function, wherein the preset loss function is LsAnd then:
Figure FDA0003517671240000011
wherein the content of the first and second substances,
Figure FDA0003517671240000021
is a training set, TiRepresenting a fault in a training set, NHIs the size of the training set, V represents all fault instances,
Figure FDA0003517671240000022
5. the method of claim 1, wherein interpreting the location of the root cause fault instance using local interpretation comprises:
presetting two faults, namely a first fault and a second fault; wherein the first fault is a current fault to be interpreted and the second fault is a historical fault;
for each of the first faults, comparing the distance of the aggregated characteristic thereof to the aggregated characteristic of each of the second faults belonging to the same fault category, and taking the smallest one as the distance of the fault instance and the second fault;
and carrying out weighted average on the distances from all fault instances in the first fault to the second fault, and comparing a distance formula between the two faults to explain the positioning result of the root fault instance.
6. Method according to claim 5, characterized in that the first fault T is preset1And a second failure T2The distance between the two faults is expressed as:
Figure FDA0003517671240000023
wherein N isc(v) Indicating that all fault instances v belong to the same fault category,
Figure FDA0003517671240000024
indicates a fault T1The aggregated characteristics of the medium-fault instance v,
Figure FDA0003517671240000025
indicates a fault T2Aggregation feature of medium fault instance v' | · | | non-woven1Representing the L1 norm.
7. The method of claim 1, wherein the trained root cause localization model is globally interpreted by a trained decision tree, comprising:
extracting a first time sequence characteristic from the index data by utilizing a plurality of time sequence curve characteristic extractions;
selecting the first time sequence feature by using feature extraction based on index reconstruction to obtain a second time sequence feature;
training a decision tree based on the second time sequence characteristics to obtain the trained decision tree, and carrying out root judgment on the fault instance by utilizing the trained root cause positioning model according to the trained decision tree; wherein the rules in the trained decision tree are used as a global interpretation of the trained root cause localization model.
8. The method of claim 7, wherein the index reconstruction feature extraction comprises:
taking the feature extractor and the feature aggregator of the trained root cause positioning model as an encoder part of an automatic encoder, and freezing parameters of the encoder part; and, constructing a decoder portion comprised of a deconvolution layer and a full-link layer;
training the auto-encoder to minimize its reconstruction error;
and comparing the values of the first time sequence characteristic in the original index data and the index data reconstructed by the automatic encoder, and screening the characteristic if the difference value is larger than a preset value.
9. The method of claim 7, wherein the plurality of timing curve feature extractions comprises: a plurality of mean, standard deviation, range count, maximum, minimum, autocorrelation coefficient, and peak count.
10. An operationally interpretable root cause locator for recurring type faults, comprising:
the fault triggering module is used for monitoring faults by using a monitoring system and triggering a trained root cause positioning model based on the faults monitored by the monitoring system; wherein the trained root cause positioning model carries out global interpretation through a trained decision tree;
the root cause positioning module is used for obtaining a root cause fault example according to the index data corresponding to the fault example when the fault occurs and the fault dependency graph, and locally explaining the positioning result of the root cause fault example; wherein the fault dependency graph is constructed from the fault instances.
CN202210168856.4A 2022-02-23 2022-02-23 Operable and interpretable root cause positioning method for repeated occurrence type faults Pending CN114661504A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210168856.4A CN114661504A (en) 2022-02-23 2022-02-23 Operable and interpretable root cause positioning method for repeated occurrence type faults

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210168856.4A CN114661504A (en) 2022-02-23 2022-02-23 Operable and interpretable root cause positioning method for repeated occurrence type faults

Publications (1)

Publication Number Publication Date
CN114661504A true CN114661504A (en) 2022-06-24

Family

ID=82028308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210168856.4A Pending CN114661504A (en) 2022-02-23 2022-02-23 Operable and interpretable root cause positioning method for repeated occurrence type faults

Country Status (1)

Country Link
CN (1) CN114661504A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116226702A (en) * 2022-09-09 2023-06-06 武汉中数医疗科技有限公司 Thyroid sampling data identification method based on bioelectrical impedance

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110231704A1 (en) * 2010-03-19 2011-09-22 Zihui Ge Methods, apparatus and articles of manufacture to perform root cause analysis for network events
US20130051248A1 (en) * 2011-08-30 2013-02-28 Dan Pei Hierarchical anomaly localization and prioritization
CN109218114A (en) * 2018-11-12 2019-01-15 西安微电子技术研究所 A kind of server failure automatic checkout system and detection method based on decision tree
CN112801316A (en) * 2021-01-28 2021-05-14 中国人寿保险股份有限公司上海数据中心 Fault positioning method, system equipment and storage medium based on multi-index data
CN113541980A (en) * 2020-04-14 2021-10-22 中国移动通信集团浙江有限公司 Network slice fault root cause positioning method and device, computing equipment and storage medium
CN113747480A (en) * 2020-05-28 2021-12-03 中国移动通信集团浙江有限公司 Processing method and device for 5G slice fault and computing equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110231704A1 (en) * 2010-03-19 2011-09-22 Zihui Ge Methods, apparatus and articles of manufacture to perform root cause analysis for network events
US20130051248A1 (en) * 2011-08-30 2013-02-28 Dan Pei Hierarchical anomaly localization and prioritization
CN109218114A (en) * 2018-11-12 2019-01-15 西安微电子技术研究所 A kind of server failure automatic checkout system and detection method based on decision tree
CN113541980A (en) * 2020-04-14 2021-10-22 中国移动通信集团浙江有限公司 Network slice fault root cause positioning method and device, computing equipment and storage medium
CN113747480A (en) * 2020-05-28 2021-12-03 中国移动通信集团浙江有限公司 Processing method and device for 5G slice fault and computing equipment
CN112801316A (en) * 2021-01-28 2021-05-14 中国人寿保险股份有限公司上海数据中心 Fault positioning method, system equipment and storage medium based on multi-index data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116226702A (en) * 2022-09-09 2023-06-06 武汉中数医疗科技有限公司 Thyroid sampling data identification method based on bioelectrical impedance
CN116226702B (en) * 2022-09-09 2024-04-26 武汉中数医疗科技有限公司 Thyroid sampling data identification method based on bioelectrical impedance

Similar Documents

Publication Publication Date Title
CN105117602B (en) A kind of metering device running status method for early warning
CN109800127A (en) A kind of system fault diagnosis intelligence O&M method and system based on machine learning
CN107358366B (en) Distribution transformer fault risk monitoring method and system
US6513025B1 (en) Multistage machine learning process
CN105677791B (en) For analyzing the method and system of the operation data of wind power generating set
CN108268905A (en) A kind of Diagnosis Method of Transformer Faults and system based on support vector machines
CN113900845A (en) Method and storage medium for micro-service fault diagnosis based on neural network
CN109858140B (en) Fault diagnosis method for water chilling unit based on information entropy discrete Bayesian network
CN105930629B (en) A kind of on-line fault diagnosis method based on magnanimity service data
CN116450399B (en) Fault diagnosis and root cause positioning method for micro service system
CN110032463A (en) A kind of system fault locating method and system based on Bayesian network
CN108761377A (en) A kind of electric energy metering device method for detecting abnormality based on long memory models in short-term
CN108022058A (en) A kind of wind energy conversion system state reliability estimation method
CN113032238A (en) Real-time root cause analysis method based on application knowledge graph
CN109492790A (en) Wind turbines health control method based on neural network and data mining
CN110119518A (en) A method of engine failure reason is diagnosed using neural network model
CN114579407B (en) Causal relationship inspection and micro-service index prediction alarm method
CN113254249A (en) Cold station fault analysis method and device and storage medium
CN117034143B (en) Distributed system fault diagnosis method and device based on machine learning
CN115237717A (en) Micro-service abnormity detection method and system
CN116914917A (en) Big data-based monitoring and management system for operation state of power distribution cabinet
CN109870306A (en) A kind of tractor engine Fault diagnosis and forecast method
CN114661504A (en) Operable and interpretable root cause positioning method for repeated occurrence type faults
William et al. Novel Approach for Software Reliability Analysis Controlled with Multifunctional Machine Learning Approach
CN112417700A (en) Fault diagnosis system of EH oil station based on state evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination