CN116450399B - Fault diagnosis and root cause positioning method for micro service system - Google Patents

Fault diagnosis and root cause positioning method for micro service system Download PDF

Info

Publication number
CN116450399B
CN116450399B CN202310697266.5A CN202310697266A CN116450399B CN 116450399 B CN116450399 B CN 116450399B CN 202310697266 A CN202310697266 A CN 202310697266A CN 116450399 B CN116450399 B CN 116450399B
Authority
CN
China
Prior art keywords
fault
fault diagnosis
data
training
root cause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310697266.5A
Other languages
Chinese (zh)
Other versions
CN116450399A (en
Inventor
陈鹏
宋雨佳
温序铭
辛茹月
赵志明
陈娟
熊玲
李曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xihua University
Original Assignee
Xihua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xihua University filed Critical Xihua University
Priority to CN202310697266.5A priority Critical patent/CN116450399B/en
Publication of CN116450399A publication Critical patent/CN116450399A/en
Application granted granted Critical
Publication of CN116450399B publication Critical patent/CN116450399B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention discloses a fault diagnosis and root cause positioning method of a micro-service system, which relates to the technical field of computers and comprises the following steps of S1, constructing X abnormal detection models; s2, acquiring monitoring data in a micro-service system as a training data set; s3, training and optimizing X abnormal detection models; s4, constructing a fault diagnosis model according to the training result; s5, performing causal relation learning on the fault nodes, and constructing an abnormal propagation graph; s6, acquiring monitoring data in real time; s7, analyzing fault diagnosis results of the monitoring data by using a fault diagnosis model; s8, positioning a fault reason according to the abnormal propagation diagram and the fault diagnosis result; according to the method, through a pre-training mechanism, X models which are most suitable for detecting CPU occupation, memory leakage and network delay are automatically selected from X abnormal detection models, and the X models are cascaded, so that the purpose of fault diagnosis of observation data is achieved. The purpose of locating the fault service is achieved by capturing the abnormal mode of the fault data and combining the abnormal mode with the root cause locating method.

Description

Fault diagnosis and root cause positioning method for micro service system
Technical Field
The invention relates to the technical field of computers, in particular to a fault diagnosis and root cause positioning method for a micro-service system.
Background
Micro-services are an architectural and organizational method for developing software consisting of small stand-alone services that communicate through well-defined APIs. These services are responsible for each small independent team. The micro-service architecture makes the application easier to expand and develop faster, thereby speeding up innovation and shortening the time to market for new functions. The nature of micro-service architecture, with its high availability, rapid evolution and ease of extension, has become very popular in web application development. The micro-service architecture breaks up the application into a plurality of mini-services, which enable faster development and maintenance, and can provide greater flexibility. Meanwhile, since the micro service architecture disperses the whole application into a plurality of services, fault diagnosis is very difficult, and occurrence of faults is unavoidable in a production scene of a large number of accesses, so that the type of faults can be rapidly found and the fault-occurring services can be located, which is important to ensure the quality and user experience of the micro services.
The microservice system is typically monitored using a multivariate time series. The multivariate time series reflects whether a system is functioning properly by collecting microservice information for each timestamp. System fault diagnosis is to identify faults from a real-time sequence and diagnose the cause of the abnormality while reporting the occurrence of micro-service fault behavior. Therefore, in the micro-service system, system fault diagnosis is utilized to report fault occurrence reasons, such as CPU occupation, memory leakage and the like. Therefore, system fault diagnosis is of great importance to improve the reliability of the micro-service system. In addition, due to the complex dependency relationship between services in the micro-service architecture, when one micro-service fails, the failure may propagate to multiple micro-services along the dependency relationship. So when a fault occurs, an operator needs to quickly find the root cause of the fault of the whole system, namely, the root cause positioning.
At present, various solutions have been proposed to automate the detection of faults and to automatically determine their possible root causes. Existing fault detection solutions rely on identifying anomalies in service behavior that may be symptoms of their possible failure. However, most methods at present only detect whether an abnormality occurs, and fail to implement fault diagnosis, that is, give a specific cause of the occurrence of the fault, which results in that after the occurrence of the fault, an operator cannot quickly find the cause of the occurrence of the fault, resulting in a slow speed of fault removal and an increase in loss. In addition, most existing fault diagnosis methods are supervised fault diagnosis methods, and the supervised fault diagnosis methods require training data to be marked, however, in the production scene of huge access of a micro-service system, a large amount of manpower, material resources and financial resources are required for manually marking the data, so that the application range of most companies is limited. At the same time, since existing solutions for fault diagnosis and root cause analysis are scattered in different literature segments, and only anomaly detection or root cause analysis is often focused on, this hinders the work of the application operator.
In addition, due to huge data volume in the micro-service architecture, the dependency relationship among the services is complex, and the existing fault diagnosis and root cause positioning method still has the following defects: in the fault diagnosis method, a lot of manpower and financial resources are consumed for classifying the sample. The unsupervised fault diagnosis clustering method is insufficient to extract the characteristics of the monitoring data, and the fault diagnosis precision is unsatisfactory.
Disclosure of Invention
The invention aims to solve the problems and designs a fault diagnosis and root cause positioning method for a micro-service system.
The invention realizes the above purpose through the following technical scheme:
the fault diagnosis and root cause positioning method for the micro-service system comprises the following steps:
s1, constructing X abnormal detection models;
s2, acquiring monitoring data in the micro-service system as a training data set;
s3, respectively importing the training data sets into X anomaly detection models, and training and optimizing the training data sets to obtain X optimized anomaly detection models;
s4, analyzing training results of the X optimized anomaly detection models; constructing a fault diagnosis model according to the training result;
s5, performing causal relation learning on the fault nodes, and constructing a causal relation graph between the nodes as an abnormal propagation graph;
s6, acquiring monitoring data in real time;
s7, analyzing fault diagnosis results of the monitoring data by using a fault diagnosis model;
s8, positioning a fault reason according to the abnormal propagation diagram and the fault diagnosis result.
The invention has the beneficial effects that: according to the method, through a pre-training mechanism, X models which are most suitable for detecting CPU occupation, memory leakage and network delay are automatically selected from X abnormal detection models, and the X models are cascaded, so that the purpose of fault diagnosis of observation data is achieved. In addition, the purpose of locating the fault service is achieved by capturing the abnormal mode of the fault data and combining the abnormal mode with the root cause locating method. Meanwhile, by comparing the fault diagnosis and root cause positioning performance with the existing method on 5 real micro-service data sets, the method is proved to be capable of realizing higher fault diagnosis and root cause positioning; the method and the system realize the diagnosis of system faults in huge data in a micro-service architecture and complex dependency relations of each service, and locate the root cause of the faults.
Drawings
FIG. 1 is a block diagram of a method for fault diagnosis and root cause localization of a microservice system according to the present invention;
FIG. 2 is a flow chart of a training mechanism of the method for fault diagnosis and root cause location of the micro-service system of the present invention;
FIG. 3 is a flow chart of fault diagnosis of the micro service system and root cause positioning method of the present invention;
FIG. 4 is a comprehensive ranking of the present invention on macro-precision with all baseline methods;
FIG. 5 is a comprehensive ranking of the present invention and all baseline methods on macro-scale;
FIG. 6 is a comprehensive ranking of the present invention on macro-F1 with all baseline methods.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "left", "right", etc. are based on the directions or positional relationships shown in the drawings, or the directions or positional relationships conventionally put in place when the inventive product is used, or the directions or positional relationships conventionally understood by those skilled in the art are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific direction, be configured and operated in a specific direction, and therefore should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.
In the description of the present invention, it should also be noted that, unless explicitly specified and limited otherwise, terms such as "disposed," "connected," and the like are to be construed broadly, and for example, "connected" may be either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The following describes specific embodiments of the present invention in detail with reference to the drawings.
The fault diagnosis and root cause positioning method for the micro service system is characterized by comprising the following steps:
s1, constructing X abnormal detection models.
S2, acquiring monitoring data in the micro-service system as a training data set.
S3, respectively importing the training data sets into X anomaly detection models, and training and optimizing the training data sets to obtain X optimized anomaly detection models.
S4, analyzing training results of the X optimized anomaly detection models; constructing a fault diagnosis model according to the training result;
analyzing training results includes: extracting data containing a plurality of preset features, respectively importing X optimized abnormality detection models, and obtaining an abnormality detection effect of each optimized abnormality detection model on the preset features as a training result.
Constructing a fault diagnosis model: and selecting X abnormal detection models from the X optimized abnormal detection models according to the training result, and cascading to construct a fault diagnosis model according to the quality sequence of the training result, wherein X is smaller than X.
S5, performing causal relation learning on the fault nodes by using a PC algorithm, and constructing a causal relation graph among the nodes as an abnormal propagation graph.
S6, acquiring monitoring data in real time.
S7, analyzing fault diagnosis results of the monitoring data by using the fault diagnosis model.
S8, positioning a fault reason according to the abnormal propagation diagram and the fault diagnosis result; the method specifically comprises the following steps:
s81, performing root cause analysis on the abnormal propagation graph by using a PageRank algorithm, and outputting initial causal scores v of all fault nodes, wherein the initial causal scores v are expressed asWherein μ represents the number of services in the micro-service architecture, α represents the transition probability of the probability matrix, P is the transition probability, and the transition probability P of nodes i to j ij Denoted as->Wherein w is ij Representing the weight between i and j obtained through a PC algorithm;
s82, calculating the abnormality degree of the fault data by using the fault diagnosis result, wherein the abnormality degree is expressed asWherein->Representing the monitoring data, gamma representing the type of fault, i representing the ith time step in the sliding window, k representing the feature quantity, n representing the length of time the fault data was entered, and red γ An abnormal threshold returned when the fault type is gamma for diagnosis;
s83, calculating final causal score of each node by weighting and integrating according to the initial causal score and the degree of abnormality γ Expressed asBeta represents the contribution of the root cause inference score to the last causal score, ++>Indicating when +.>Initial causal score of the service node at the time of the type of fault.
S84, sorting the fault importance degree of the fault nodes according to the final causal score to locate the fault reason.
According to the method, 3 models which are most suitable for detecting CPU occupation, memory leakage and network delay are automatically selected from 8 abnormal detection models through a pre-training mechanism, and the 3 models are cascaded, so that the purpose of fault diagnosis of observed data is achieved. In addition, the purpose of locating the fault service is achieved by capturing the abnormal mode of the fault data and combining the abnormal mode with the root cause locating method. Meanwhile, by comparing the fault diagnosis and root cause positioning performance with the existing method on 5 real micro-service data sets, the method is proved to be capable of realizing higher fault diagnosis and root cause positioning; the method and the system realize the diagnosis of system faults in huge data in a micro-service architecture and complex dependency relations of each service, and locate the root cause of the faults.
The working principle of the fault diagnosis and root cause positioning method of the micro service system is as follows:
in the description of the present working principle, x= 8,x =3
Training and optimizing 8 abnormal detection models by a training data set; and 3 models with the best abnormality detection effect are automatically selected from the 8 optimized abnormality detection models by sampling data to serve as cascading models for fault diagnosis. In the training optimization of the anomaly detection model, three models with the best effect of detecting three faults are obtained respectively. The data generated during normal operation of the CPU, the latency and the memory in the training data are respectively extracted and put into the corresponding models, and the models are trained, so that the three models can fully learn the characteristics of the CPU, the latency and the memory during normal operation. And determining the diagnosis sequence of the faults by using the F1 value when the model is selected. The F1 value of each model is used as the diagnosis sequence in the order from the big to the small.
In the test stage, the trained three models are cascaded, fault diagnosis is carried out on input data, and the fault diagnosis flow is shown in figure 3. The method comprises the following specific steps: when input data passes through the abnormality detection model 1, only detecting the fault 1, outputting the data as the fault 1 when the detected data is abnormal, otherwise, representing that the data is normal or other types of faults, then placing the data which is output by the model and is not the fault 1 into a second abnormality detection model, diagnosing the fault 2, and so on, and finally outputting x faults by x abnormality detection models, and outputting the data which does not meet the x faults as normal data.
In the root cause localization section, the goal in causal structure learning is to construct a causal graph of monitoring metrics. The causal graph can be seen as an abnormal propagation path between metrics. A more popular method of constructing a causal graph from observed data is the PC algorithm. The PC algorithm uses statistical tests to perform a conditional independence analysis to learn causal relationships between random variables. Defining monitoring data in a microservice system asWhere i represents the ith time step in the sliding window and k represents the feature quantity. Taking each time sequence as a variable, taking the data of each time point as a sample, the PC algorithm outputs a directed acyclic graph DAG with k nodes. The method comprises the following steps: the k points are connected into a fully connected undirected graph G, and the condition independence of each adjacent node in the G is tested. If conditional independence exists, the edge between two nodes is deleted. The principle of v separation is then used to determine the direction of dependence of the edges in the graph and extend the skeleton to the DAG. Typically, when an abnormality occurs in the key performance indicator, this indicates that the service has failed. When a network delay occurs in a micro service, the entire service fails. When the abnormality occurs, extracting the latency characteristic in the abnormal data, inputting the characteristics into root cause analysis, and constructing an abnormality propagation chart by using a PC algorithm. In order to locate the fault server from the anomaly propagation map, the PageRank algorithm is used to conduct root cause analysis on the anomaly propagation map. Will P ij The conversion probability of the node i to the node j is defined as follows:
wherein w is ij Representing the weight between i and j obtained through the PC algorithm. After obtaining the probability matrix, calculating the root cause score v of each node, wherein the specific formula is as follows:
where α represents the transition probability of the probability matrix, which is set to 0.85 in the present description of principles.
In addition to the topology reasons, during certain system failures, the abnormal pattern of the system can also affect the final root cause results. In order to capture an abnormal pattern of the failure data, abnormal thresholds at the time of CPU occupation, memory leak, and network delay diagnosis are respectively returned during the failure diagnosis. The three abnormal detection thresholds are used for respectively calculating the abnormality degree eta of data when three faults of CPU occupation, memory leakage and network delay occur, and the specific calculation formula is as follows:
,/>,/>the degree of abnormality of the input data when a CPU abnormality, a Memory abnormality, and a Latency abnormality occur, respectively.
Causal integration: combining the anomaly degree and the initial causal score, and performing causal integration to obtain a final causal score, wherein the specific formula is as follows:
wherein the method comprises the steps ofBeta represents root cause inferred score versus lastThe contribution of the causal fraction is set to β of 0.5 in the present description of the principles.
After the final causal score is obtained, the ranking from big to small is performed according to all the final causal scores, and the higher the ranking, the greater the probability that the service is the root cause that causes the service to fail.
Test procedure
1. Data set
In the experiment, a sock-shop e-commerce website is deployed, which serves as a benchmark for micro-services and cloud native technologies. The web site includes 13 services including foreground, directory, shopping cart, user, order, payment, transportation, etc. functional services, and communication services to facilitate communication between different services.
The sock-chop is deployed on multiple virtual machines (vm) in the cloud using Kubernetes. The Kubernetes cluster includes 1 master node and 3 working nodes, each configured as Ubuntu 18.04, 4vCPU, 16G RAM Memory, 80G Disk. The system is monitored and service level and resource level data is collected on the master node using open source monitoring and visualization tools promethaus and Grafana. In addition, the workload generation tool Locust is used on the master node to simulate the workload of the micro-service application. 13 sock-shop services are deployed on the working nodes and automatically allocated to different virtual machines by Kubernetes.
In order to simulate an actually running application, three common anomalies are injected, CPU occupation, memory leakage and network delay. And simulating network faults by using a Pumba tool, and performing pressure test resources for the Docker container to realize abnormal injection. For CPU hog, CPU resources are consumed for each service, for memory leakage, memory is allocated continuously for each service, for network delay, network messages are delayed by flow control. Each anomaly lasts 1 to 5 minutes, the application runs normally for 10 to 30 minutes, and the process is then repeated at least 5 times for each anomaly. Data is collected in real time every 5 seconds according to the Prometaus configuration, while service level and resource level data is collected. At the service level, the delay for each service is collected. At the resource level, the indexes related to the container resources are collected, including CPU use condition, memory use condition, disk read-write condition and network receiving and transmitting byte number.
The following comparative experiments were used to compare the performance of the present method with that of the existing methods.
2. Data preprocessing
In order to improve the precision of the model, a training set and a testing set are processed through data standardization, and data of different specifications are converted into unified specifications, so that the influence of scale, characteristics and distribution differences on the model is reduced. The min-max normalization was used.
3. Model training process
In fig. 1, (1) is a pre-training module of the present model, and the main purpose of designing the module is to automatically and adaptively select the best suitable model from candidate models, and to implement the best detection model of some kind of abnormality in cpu, memory and latency. The specific flow is shown in fig. 2, where random sampling is performed on training data. The method comprises the following specific steps: a random number is generated on the preprocessed training data, the training data is sampled according to the random number, and a data set with the size of 500 is extracted as a training subset. Then randomly extracting a piece of data occupied by the CPU from the training data, generating a data set with the size of 100 according to the data, simultaneously, respectively sampling the data including memory leakage and network delay by the same method, and splicing the three data sets together to form a test subset with the size of 300. And extracting CPU features in the training subset and the testing subset which are randomly sampled, respectively putting the CPU features in the candidate models for anomaly detection, and outputting F1 values. When CPU abnormality detection is performed, other abnormalities in the data are regarded as normal data. After 5 samplings. And outputting the average F1 value of abnormality detection after 5 times of sampling. Next, a model with the largest average F1 value is found and the model name is output. And respectively obtaining the models which are most suitable for detecting the network delay and the memory leakage in the same way.
As shown in fig. 1 (2), in the data pre-training module, optimal models for detecting three kinds of faults are obtained, respectively. The monitoring data of the CPU, the latency and the memory during normal operation in the training data are respectively extracted and put into the corresponding models, and the models are trained, so that the three models can fully learn the characteristics of the CPU, the latency and the memory during normal operation. And determining the diagnosis sequence of the faults by using the F1 value of the model selected in the pre-training stage. The F1 value of each model is used as the diagnosis sequence in the order from the big to the small.
After model training is completed, as shown in (3) of fig. 1, the three trained models are cascaded together, and fault diagnosis is performed on the collected data. Specific steps of fault diagnosis as shown in fig. 3, first, data is input into a first model to perform abnormality detection. In the case of abnormality detection, an abnormality threshold is set, and when the error between the predicted data and the real data exceeds the threshold, the point is determined as an abnormality point. In order to improve the abnormality detection accuracy, an optimal threshold value is searched for by using an optimal F1 value method, and the threshold value is returned. When the model 1 performs abnormality detection on the input data, abnormal data indicating the occurrence of the failure 1 and normal data including other failure data and data in which no failure has occurred are output. Next, the data outputted as normal is put into the next model for the detection of the fault 2, the fault 2 and the normal data are outputted as in the first model, and the data detected as normal by the model are put into the model 3, and the fault 3 and the normal data are outputted. In three fault diagnoses, three abnormal thresholds are obtained and used for capturing abnormal modes of entity measurement data so as to improve the positioning accuracy in root cause positioning.
The abnormal data is put into causal localization based on causal inference as shown in fig. 1 (4). In the part, the causal relation learning is carried out on the fault nodes through a PC algorithm, a causal relation graph among the nodes is constructed, and the causal relation graph is used as an abnormal propagation graph. And then, performing root cause analysis on the abnormal propagation graph by using PageRank, and outputting the causal scores of the fault nodes.
In order to further improve the positioning accuracy, the abnormal threshold returned in the fault diagnosis stage is used for respectively calculating the abnormality degree of fault data when CPU occupation, memory leakage and network delay occur, and the specific formula is as follows:
wherein wired returns an abnormal threshold when diagnosing the corresponding fault.
As shown in fig. 1 (5), the causal score obtained by the causal localization based root inference method and the anomaly score of the data are weighted and integrated, and the contribution of the causal localization based root localization score and anomaly to the final causal score is adjusted by β. And after the final causal score is obtained, the fault importance of the fault nodes is ordered according to the causal score. The earlier the ranking, the greater the probability that the node is the root cause of the fault propagation.
Model performance index
Fault diagnosis
The performance comparison of the model employs several key performance indicators based on confusion matrix classification: macro precision, macro recall, macro F1-Score.
The accuracy refers to the proportion of the samples which are actually positive in the samples with the model predicted positive to the samples predicted positive, and the calculation formula is that
The recall rate refers to the proportion of the samples predicted to be positive in the samples which are actually positive in the actual positive samples, and the calculation formula is as follows:
f1 score is the harmonic mean of the precision and recall, calculated as:
macro accuracy, macro recall and macro F1 refer to the arithmetic mean of each statistical index value of all categories, respectively.
The robustness of the model was also verified using the F1 Average Rank. F1 Average Rank represents the Average ranking of macro F1-score scores for each model in the five data sets
Root cause location
In root cause localization, two widely used metrics, PR@k and Avg@k, are used to evaluate the performance of the model. PR@k represents the probability that the first k results in the root cause predicted by the root cause localization algorithm contain a true root cause. When k is smaller, a higher PR@k indicates that the algorithm more accurately identifies the root cause, and the specific formula is as follows:
wherein A represents a set of faults existing in the system, a represents one of the faults in A, V a Representing the true root cause of the fault a, R a Indicating the root cause predicted by the root cause positioning algorithm to cause the occurrence of fault a, i indicating the predicted root cause R a The ith root cause of (a).
The avg@k evaluates the performance of the model in the top k prediction reason from an overall angle, and the overall performance of the algorithm is evaluated by calculating an average PR@k, wherein a specific formula is as follows:
where j represents an accumulated count.
Model comparison results
From fig. 4, fig. 5, fig. 6, and table 1, table 2 shows that compared with the existing model, the experimental results of the model in the real data set are as follows:
as can be seen from Table 1, the present model is superior to the other 11 models for shopping, user login registration user, shopping cart carts data set, 11 models including K-means KMeans, gaussian mixture model GaussianMixture, comprehensive hierarchical clustering algorithm Birch, wasserstein distance-based generation of anti-network fault diagnosis model WPS, parallel graph attention network ensemble learning-based fault diagnosis model CGNN-MHSA-AR, anti-network generation fault diagnosis model MAD_GAN, unsupervised multivariate time series fault diagnosis model USAD, deep self-coding Gaussian mixture model-based fault diagnosis model DAGMM, graph attention network-based fault diagnosis model MTAD, random recursive neural network-based fault diagnosis model Omnianomy, and deep convolution self-coding memory network-based fault diagnosis model CAE_M. On average, the macro accuracy of this model was 78.5%, macro recall was 95.7%, and macro F1 score was 82.4%, which was highest compared to all other models. For two data sets of order orders and commodity catalogues, the macro F1 score of the model is slightly lower than that of the optimal model, and specifically comprises the following steps: on a commodity catalog category data set, the optimal macro F1 value of 0.930 can be achieved by using a fault diagnosis model MTAD based on a graph attention network, which is better than the model 3%; on the order data set, the fault diagnosis model Omni Anomaly based on the random recurrent neural network can reach the optimal macro F1 value of 0.979, which is better than the model 0.9%.
As can be seen from table 2, the positioning accuracy of the present method on pr@1 is 0.8 for CPU failure, which indicates that 80% is likely to find the root cause on the first 1 index of the rank. Similarly, for memory leakage and Network latency failures, the method has a probability of finding the root cause on the first 1 index of the rank of 60%,40% respectively. Overall, the positioning accuracy average avg@5 in the 5 data sets for the cpu fault is 0.88. Compared with a root cause positioning method (GES-based) with the best performance and based on greedy search, the root cause positioning accuracy of the method is improved by 32%; compared with a root cause positioning method (PC-based) with the best performance based on causal relation prediction, the method improves the positioning accuracy of memory leakage by 24%; compared with the root cause positioning method (PC-based) with the best performance based on causal relation prediction, the method improves the positioning precision of network delay by 28%; the method is superior to a root cause positioning method (LiNGAM-based) based on a linear non-Gaussian loop-free model in the positioning precision of cpu faults, memory leakage and network delay.
Fig. 4, 5 and 6 show the performances of macro-precision, macro-recovery and macro-f1 on the model and 11 models respectively, and it can be seen from the figures that the model performs better on three evaluation indexes, so that the model can better select the models suitable for diagnosing three faults and is cascaded together to enable the models to have higher performance in fault diagnosis.
Table 1, comparison of detection performance of the present technique with 11 fault diagnoses on 5 datasets
Table 2, comparison of the present technique with the average positioning performance of 3 root cause positioning methods on 5 datasets
The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.

Claims (5)

1. The fault diagnosis and root cause positioning method for the micro service system is characterized by comprising the following steps:
s1, constructing X abnormal detection models;
s2, acquiring monitoring data in the micro-service system as a training data set;
s3, respectively importing the training data sets into X anomaly detection models, and training and optimizing the training data sets to obtain X optimized anomaly detection models;
s4, analyzing training results of the X optimized anomaly detection models; constructing a fault diagnosis model according to the training result; analyzing training results includes: extracting data containing a plurality of preset features, respectively importing X optimized abnormality detection models, and obtaining an abnormality detection effect of each optimized abnormality detection model on the preset features as a training result; selecting X abnormal detection models from the X optimized abnormal detection models according to the training result, and cascading to construct a fault diagnosis model according to the quality sequence of the training result, wherein X is smaller than X;
s5, performing causal relation learning on the fault nodes, and constructing a causal relation graph between the nodes as an abnormal propagation graph;
s6, acquiring monitoring data in real time;
s7, analyzing fault diagnosis results of the monitoring data by using a fault diagnosis model;
s8, positioning a fault reason according to the abnormal propagation diagram and the fault diagnosis result; the method specifically comprises the following steps:
s81, performing root cause analysis on the abnormal propagation graph by using a PageRank algorithm, and outputting initial causal scores of all fault nodes;
s82, calculating the anomaly degree of the fault data by using the fault diagnosis result respectively;
s83, weighting and integrating the initial causal score and the degree of abnormality to calculate the final causal score of each node;
s84, sorting the fault importance degree of the fault nodes according to the final causal score to locate the fault reason.
2. The method for diagnosing and locating root causes of a micro service system according to claim 1, wherein in S5, a causal relation graph among nodes is constructed as an abnormal propagation graph by performing causal relation learning on the faulty nodes by using a PC algorithm.
3. The method for fault diagnosis and root cause localization of a micro-service system according to claim 1, wherein in S81, an initial causal score v is expressed asWherein [ mu ] represents the number of services in the micro-service architecture, [ alpha ] represents the transition probability of the probability matrix, P is the transition probability, and the transition probability P of nodes i to j ij Denoted as->Wherein w is ij Representing the weight between i and j obtained through the PC algorithm.
4. The method for fault diagnosis and root cause localization of micro-service system according to claim 1, wherein in S82, the degree of anomaly η of the calculated fault data is expressed asWherein->Representing the monitoring data, gamma representing the type of fault, i representing the ith time step in the sliding window, k representing the feature quantity, n representing the length of time the fault data was entered, and red γ An anomaly threshold value returned when the fault type is gamma for diagnosis.
5. The method for fault diagnosis and root cause localization of a micro-service system according to claim 1, wherein in S83, a final causal score is calculated γ Expressed as score γ =β×v γ +(1-β)×η γ Beta represents the contribution of the root cause inference score to the final causal score, v γ Representing an initial causal score for the serving node when a gamma type fault occurs.
CN202310697266.5A 2023-06-13 2023-06-13 Fault diagnosis and root cause positioning method for micro service system Active CN116450399B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310697266.5A CN116450399B (en) 2023-06-13 2023-06-13 Fault diagnosis and root cause positioning method for micro service system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310697266.5A CN116450399B (en) 2023-06-13 2023-06-13 Fault diagnosis and root cause positioning method for micro service system

Publications (2)

Publication Number Publication Date
CN116450399A CN116450399A (en) 2023-07-18
CN116450399B true CN116450399B (en) 2023-08-22

Family

ID=87120528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310697266.5A Active CN116450399B (en) 2023-06-13 2023-06-13 Fault diagnosis and root cause positioning method for micro service system

Country Status (1)

Country Link
CN (1) CN116450399B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093407B (en) * 2023-10-19 2024-03-19 北京凡得科技有限公司 Improved S-learner-based flow anomaly cascade root cause analysis method and system
CN117149500B (en) * 2023-10-30 2024-01-26 安徽思高智能科技有限公司 Abnormal root cause obtaining method and system based on index data and log data
CN117609762B (en) * 2023-11-22 2024-06-25 重庆杰友电气材料有限公司 Well lid early warning and monitoring method and system based on intelligent gas well
CN117290764B (en) * 2023-11-23 2024-02-09 湖南省交通科学研究院有限公司 Method for intelligently identifying and diagnosing faults of ultra-system based on data feature analysis
CN117688472B (en) * 2023-12-13 2024-05-24 华东师范大学 Unsupervised domain adaptive multivariate time sequence classification method based on causal structure
CN117596126B (en) * 2024-01-19 2024-03-26 合肥先进计算中心运营管理有限公司 Monitoring method for high-speed network abnormality in high-performance cluster

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111683108A (en) * 2020-08-17 2020-09-18 鹏城实验室 Method for generating network flow anomaly detection model and computer equipment
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN113391943A (en) * 2021-06-18 2021-09-14 广东工业大学 Micro-service fault root cause positioning method and device based on cause and effect inference
CN113900845A (en) * 2021-09-28 2022-01-07 大唐互联科技(武汉)有限公司 Method and storage medium for micro-service fault diagnosis based on neural network
CN114003466A (en) * 2021-11-04 2022-02-01 南京大学 Fault root cause positioning method for micro-service application program
CN114282434A (en) * 2021-12-16 2022-04-05 成都航天科工大数据研究院有限公司 Industrial equipment health management system and method
CN114666204A (en) * 2022-04-22 2022-06-24 广东工业大学 Fault root cause positioning method and system based on cause and effect reinforcement learning
CN115237717A (en) * 2022-07-28 2022-10-25 西南科技大学 Micro-service abnormity detection method and system
CN115309575A (en) * 2022-06-27 2022-11-08 南开大学 Micro-service fault diagnosis method, device and equipment based on graph convolution neural network
CN115333921A (en) * 2022-08-20 2022-11-11 海南大学 Micro-service abnormal root cause positioning method and device
CN115470854A (en) * 2022-09-15 2022-12-13 国网辽宁省电力有限公司信息通信分公司 Information system fault classification method and classification system
CN115526847A (en) * 2022-09-19 2022-12-27 江南大学 Mainboard surface defect detection method based on semi-supervised learning
CN115640159A (en) * 2022-11-03 2023-01-24 香港中文大学深圳研究院 Micro-service fault diagnosis method and system
CN115730262A (en) * 2022-11-25 2023-03-03 西华大学 Abnormity diagnosis method and device of data-driven cloud platform system
CN115756929A (en) * 2022-11-23 2023-03-07 北京大学 Abnormal root cause positioning method and system based on dynamic service dependency graph
CN116108371A (en) * 2023-04-13 2023-05-12 西华大学 Cloud service abnormity diagnosis method and system based on cascade abnormity generation network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230007023A1 (en) * 2021-06-30 2023-01-05 Dropbox, Inc. Detecting anomalous digital actions utilizing an anomalous-detection model

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111683108A (en) * 2020-08-17 2020-09-18 鹏城实验室 Method for generating network flow anomaly detection model and computer equipment
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN113391943A (en) * 2021-06-18 2021-09-14 广东工业大学 Micro-service fault root cause positioning method and device based on cause and effect inference
CN113900845A (en) * 2021-09-28 2022-01-07 大唐互联科技(武汉)有限公司 Method and storage medium for micro-service fault diagnosis based on neural network
CN114003466A (en) * 2021-11-04 2022-02-01 南京大学 Fault root cause positioning method for micro-service application program
CN114282434A (en) * 2021-12-16 2022-04-05 成都航天科工大数据研究院有限公司 Industrial equipment health management system and method
CN114666204A (en) * 2022-04-22 2022-06-24 广东工业大学 Fault root cause positioning method and system based on cause and effect reinforcement learning
CN115309575A (en) * 2022-06-27 2022-11-08 南开大学 Micro-service fault diagnosis method, device and equipment based on graph convolution neural network
CN115237717A (en) * 2022-07-28 2022-10-25 西南科技大学 Micro-service abnormity detection method and system
CN115333921A (en) * 2022-08-20 2022-11-11 海南大学 Micro-service abnormal root cause positioning method and device
CN115470854A (en) * 2022-09-15 2022-12-13 国网辽宁省电力有限公司信息通信分公司 Information system fault classification method and classification system
CN115526847A (en) * 2022-09-19 2022-12-27 江南大学 Mainboard surface defect detection method based on semi-supervised learning
CN115640159A (en) * 2022-11-03 2023-01-24 香港中文大学深圳研究院 Micro-service fault diagnosis method and system
CN115756929A (en) * 2022-11-23 2023-03-07 北京大学 Abnormal root cause positioning method and system based on dynamic service dependency graph
CN115730262A (en) * 2022-11-25 2023-03-03 西华大学 Abnormity diagnosis method and device of data-driven cloud platform system
CN116108371A (en) * 2023-04-13 2023-05-12 西华大学 Cloud service abnormity diagnosis method and system based on cascade abnormity generation network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
复杂软件的级联故障建模;王健;刘衍珩;刘雪莲;;计算机学报;第34卷(第06期);1137-1147 *

Also Published As

Publication number Publication date
CN116450399A (en) 2023-07-18

Similar Documents

Publication Publication Date Title
CN116450399B (en) Fault diagnosis and root cause positioning method for micro service system
CN111459700B (en) Equipment fault diagnosis method, diagnosis device, diagnosis equipment and storage medium
CN112596495A (en) Industrial equipment fault diagnosis method and system based on knowledge graph
CN105677791B (en) For analyzing the method and system of the operation data of wind power generating set
WO2021143268A1 (en) Electric power information system health assessment method and system based on fuzzy inference theory
CN113010389A (en) Training method, fault prediction method, related device and equipment
Wang et al. Log-based anomaly detection with the improved K-nearest neighbor
CN110297207A (en) Method for diagnosing faults, system and the electronic device of intelligent electric meter
CN117034143B (en) Distributed system fault diagnosis method and device based on machine learning
CN115204536A (en) Building equipment fault prediction method, device, equipment and storage medium
CN114266289A (en) Complex equipment health state assessment method
CN116124398A (en) Rotary machine fault detection method and device, equipment and storage medium
CN116306806A (en) Fault diagnosis model determining method and device and nonvolatile storage medium
CN116319255A (en) Root cause positioning method, device, equipment and storage medium based on KPI
JP6992922B1 (en) Data division device, data division method, and program
CN111367781B (en) Instance processing method and device
CN111913872A (en) Software static inspection warning sequencing optimization method based on defect prediction
CN109474445B (en) Distributed system root fault positioning method and device
CN117149500B (en) Abnormal root cause obtaining method and system based on index data and log data
CN117951529B (en) Sample acquisition method, device and equipment for hard disk data fault prediction
CN112187555B (en) Real-time KPI data anomaly detection method and device based on machine learning
CN118051657B (en) Method and system for testing case library for fault location of data private line
CN118075090A (en) Network fault prediction method based on machine learning
CN118297444A (en) Artificial intelligence-oriented data set quality general assessment method
CN117149486A (en) Alarm and root cause positioning method, model training method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant