CN116450399A

CN116450399A - Fault diagnosis and root cause positioning method for micro service system

Info

Publication number: CN116450399A
Application number: CN202310697266.5A
Authority: CN
Inventors: 陈鹏; 宋雨佳; 温序铭; 辛茹月; 赵志明; 陈娟; 熊玲; 李曦
Original assignee: Xihua University
Current assignee: Xihua University
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-07-18
Anticipated expiration: 2043-06-13
Also published as: CN116450399B

Abstract

The invention discloses a microservice system fault diagnosis and root cause location method, relates to the field of computer technology, including S1 constructing X abnormality detection models; S2 acquiring monitoring data in the microservice system as a training data set; S3 training and optimizing X abnormality detections S4 builds a fault diagnosis model based on the training results; S5 learns the causality of fault nodes and builds an abnormality propagation map; S6 acquires monitoring data in real time; S7 uses the fault diagnosis model to analyze the fault diagnosis results of monitoring data; Fault diagnosis results locate the cause of the fault; this method automatically selects x models that are most suitable for detecting CPU usage, memory leaks, and network delays from X anomaly detection models through a pre-training mechanism, and cascades these x models to achieve The purpose of fault diagnosis on observation data. By capturing the abnormal pattern of the fault data and combining it with the root cause location method, the purpose of locating the faulty service is achieved.

Description

Microservice system fault diagnosis and root cause location method

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种微服务系统故障诊断及根因定位方法。The invention relates to the field of computer technology, in particular to a microservice system fault diagnosis and root cause location method.

背景技术Background technique

微服务是一种开发软件的架构和组织方法，其中软件由通过明确定义的 API 进行通信的小型独立服务组成。这些服务由各个小型独立团队负责。微服务架构使应用程序更易于扩展和更快地开发，从而加速创新并缩短新功能的上市时间。微服务体系结构以其高可用性，快速进化和易于扩展的特性在web应用程序开发中变得非常流行。微服务架构将应用程序分解为多个小型服务，使其开发，维护更快，可以提供更高的灵活性。同时，正是因为微服务架构将整个应用分散为多个服务，导致故障诊断非常困难，并且在大量访问量的生产场景下，故障的出现是不可避免的，所以，能够快速发现故障的类型并定位到发生故障的服务，对于保证微服务的质量和用户体验是至关重要的。Microservices is an architectural and organizational approach to developing software consisting of small independent services that communicate through well-defined APIs. These services are handled by various small independent teams. A microservices architecture makes applications easier to scale and faster to develop, accelerating innovation and reducing time-to-market for new features. Microservice architecture has become very popular in web application development due to its high availability, fast evolution and easy scalability. The microservice architecture decomposes the application into multiple small services, making its development and maintenance faster and can provide higher flexibility. At the same time, it is precisely because the microservice architecture disperses the entire application into multiple services, which makes fault diagnosis very difficult, and in production scenarios with a large number of visits, faults are inevitable, so the type of fault can be quickly discovered and Locating the faulty service is crucial to ensure the quality and user experience of microservices.

通常使用多元时间序列来监控微服务系统。多变量时间序列通过收集每个时间戳的微服务信息去反映一个系统是否运行正常。系统故障诊断是从实时序列中识别故障并诊断导致异常的原因，同时报告微服务故障行为的发生。所以在微服务体系中，利用系统故障诊断来报告故障发生原因，如CPU占用，内存泄露等。因此，系统故障诊断对提高微服务系统的可靠性具有重要意义。除此之外，由于微服务架构中各服务之间复杂的依赖关系，当一个微服务发生故障后，故障可能会沿着依赖关系传播到多个微服务。所以当故障发生时，操作人员需要快速找到导致整个体系发生故障的根源，即根因定位。Microservice systems are often monitored using multivariate time series. Multivariate time series reflect whether a system is running normally by collecting microservice information at each time stamp. System fault diagnosis is to identify faults from real-time sequences and diagnose the causes of abnormalities, and at the same time report the occurrence of microservice fault behaviors. Therefore, in the microservice system, system fault diagnosis is used to report the cause of the fault, such as CPU usage, memory leak, etc. Therefore, system fault diagnosis is of great significance to improve the reliability of microservice systems. In addition, due to the complex dependencies between services in the microservice architecture, when a microservice fails, the failure may propagate to multiple microservices along the dependencies. Therefore, when a failure occurs, the operator needs to quickly find the root cause of the failure of the entire system, that is, root cause location.

目前，已提出各种解决方案，使故障的检测自动化，并自动确定其可能的根本原因。现有的故障检测解决方案依赖于识别服务行为中的异常，这些异常可能是其可能发生故障的症状。然而，目前大多数方法只能检测是否发生了异常，不能实现故障诊断，即给出具体发生故障的原因，这导致在故障发生后，操作员不能快速找到发生故障的原因，造成故障排除速度减慢，损失增大。除此之外，存在的故障诊断方法大多数为有监督故障诊断方法，有监督故障诊断方法要求训练数据是有标注的，然而在微服务体系巨大访问量的生产场景下，人工标记数据需要大量的人力，物力和财力，这对大多数公司来说，可应用范围受限。同时，由于故障诊断和根源分析的现有解决方案分散在不同的文献片段中，并且往往只关注异常检测或根源分析，这阻碍了应用程序操作员的工作。Currently, various solutions have been proposed to automate the detection of faults and automatically determine their possible root causes. Existing failure detection solutions rely on identifying anomalies in the behavior of a service that may be symptoms of its possible failure. However, most of the current methods can only detect whether an abnormality has occurred, and cannot implement fault diagnosis, that is, give the specific cause of the fault. This results in that after the fault occurs, the operator cannot quickly find the cause of the fault, resulting in a slowdown in troubleshooting. Slowly, the loss increases. In addition, most existing fault diagnosis methods are supervised fault diagnosis methods. Supervised fault diagnosis methods require training data to be labeled. Human, material and financial resources, which for most companies, the scope of application is limited. At the same time, application operators are hindered since existing solutions for fault diagnosis and root cause analysis are scattered in different literature fragments and often only focus on anomaly detection or root cause analysis.

并且由于微服务架构中数据量庞大，各个服务之间依赖关系复杂，现有故障诊断和根因定位方法仍然存在下列不足：在故障诊断方法中，对样本进行标记分类需要消耗大量的人力，财力。而基于无监督的故障诊断聚类方法已经不足以提取监控数据的特征，故障诊断精度也不尽人意。Moreover, due to the huge amount of data in the microservice architecture and the complex dependencies between services, the existing fault diagnosis and root cause location methods still have the following deficiencies: In the fault diagnosis method, it takes a lot of manpower and financial resources to mark and classify samples . However, the unsupervised fault diagnosis clustering method is not enough to extract the characteristics of monitoring data, and the fault diagnosis accuracy is not satisfactory.

发明内容Contents of the invention

本发明的目的就在于为了解决上述问题设计了一种微服务系统故障诊断及根因定位方法。The object of the present invention is to design a microservice system fault diagnosis and root cause location method in order to solve the above problems.

本发明通过以下技术方案来实现上述目的：The present invention achieves the above object through the following technical solutions:

微服务系统故障诊断及根因定位方法，包括：Microservice system fault diagnosis and root cause location methods, including:

S1、构建X个异常检测模型；S1. Construct X abnormality detection models;

S2、获取微服务系统中的监控数据作为训练数据集；S2. Obtain monitoring data in the microservice system as a training data set;

S3、训练数据集分别导入X个异常检测模型，对其进行训练优化，得到X个优化后的异常检测模型；S3. Import X abnormality detection models into the training data set, train and optimize them, and obtain X optimized abnormality detection models;

S4、分析X个优化后的异常检测模型的训练结果；并根据训练结果构建故障诊断模型；S4. Analyze the training results of X optimized abnormality detection models; and build a fault diagnosis model according to the training results;

S5、对故障节点进行因果关系学习，并构建节点之间的因果关系图作为异常传播图；S5. Carry out causal relationship learning on faulty nodes, and construct a causal relationship graph between nodes as an abnormality propagation graph;

S6、实时获取监测数据；S6. Obtain monitoring data in real time;

S7、利用故障诊断模型分析监测数据的故障诊断结果；S7. Using the fault diagnosis model to analyze the fault diagnosis results of the monitoring data;

S8、根据异常传播图和故障诊断结果定位故障原因。S8. Locate the cause of the fault according to the abnormality propagation diagram and the fault diagnosis result.

本发明的有益效果在于：本方法通过预训练机制，自动从X个异常检测模型中选择x个最适合检测CPU占用，内存泄露和网络延迟的模型，并将这x个模型级联起来，实现对观测数据故障诊断的目的。除此之外，通过捕捉故障数据的异常模式，并与根因定位方法结合，达到对发生故障服务定位的目的。同时，通过在5个真实的微服务数据集上与已有的方法进行故障诊断和根因定位性能的比较，证明本方法能实现较高的故障诊断和根因定位；实现了在微服务架构中庞大的数据，各服务复杂的依赖关系中，有效对系统故障进行诊断，并对导致故障的根因进行定位。The beneficial effect of the present invention is that: the method automatically selects x models most suitable for detecting CPU usage, memory leaks and network delays from x anomaly detection models through a pre-training mechanism, and cascades these x models to realize The purpose of fault diagnosis on observation data. In addition, by capturing the abnormal pattern of the fault data and combining it with the root cause location method, the purpose of locating the faulty service is achieved. At the same time, by comparing the performance of fault diagnosis and root cause location with existing methods on five real microservice data sets, it is proved that this method can achieve higher fault diagnosis and root cause location; In the huge data and complex dependencies of each service, it can effectively diagnose system faults and locate the root cause of the fault.

附图说明Description of drawings

图1是本发明微服务系统故障诊断及根因定位方法的架构图；Fig. 1 is the architecture diagram of microservice system fault diagnosis and root cause location method of the present invention;

图2是本发明微服务系统故障诊断及根因定位方法的训练机制流程图；Fig. 2 is the training mechanism flowchart of microservice system fault diagnosis and root cause localization method of the present invention;

图3是本发明微服务系统故障诊断及根因定位方法的故障诊断流程图；Fig. 3 is the fault diagnosis flow chart of microservice system fault diagnosis and root cause location method of the present invention;

图4是本发明与所有基线方法在macro-precision上的综合排名；Figure 4 is the comprehensive ranking of the present invention and all baseline methods on macro-precision;

图5是本发明与所有基线方法在macro-recall上的综合排名；Figure 5 is the comprehensive ranking of the present invention and all baseline methods on macro-recall;

图6是本发明与所有基线方法在macro-F1上的综合排名。Figure 6 is the comprehensive ranking of the present invention and all baseline methods on macro-F1.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Apparently, the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the invention generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations.

因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。Accordingly, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely represents selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

在本发明的描述中，需要理解的是，术语“上”、“下”、“内”、“外”、“左”、“右”等指示的方位或位置关系为基于附图所示的方位或位置关系，或者是该发明产品使用时惯常摆放的方位或位置关系，或者是本领域技术人员惯常理解的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的设备或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the orientations or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "left", "right" etc. are based on those shown in the accompanying drawings. Orientation or positional relationship, or the orientation or positional relationship that is usually placed when the product of the invention is used, or the orientation or positional relationship that is commonly understood by those skilled in the art, is only for the convenience of describing the present invention and simplifying the description, rather than indicating or It should not be construed as limiting the invention by implying that a referenced device or element must have a particular orientation, be constructed, and operate in a particular orientation.

此外，术语“第一”、“第二”等仅用于区分描述，而不能理解为指示或暗示相对重要性。In addition, the terms "first", "second", etc. are only used for distinguishing descriptions, and should not be construed as indicating or implying relative importance.

在本发明的描述中，还需要说明的是，除非另有明确的规定和限定，“设置”、“连接”等术语应做广义理解，例如，“连接”可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接连接，也可以通过中间媒介间接连接，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should also be noted that, unless otherwise specified and limited, terms such as "setting" and "connection" should be understood in a broad sense. For example, "connection" can be a fixed connection or a Detachable connection, or integral connection; it can be mechanical connection or electrical connection; it can be direct connection or indirect connection through an intermediary, and it can be the internal communication of two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present invention according to specific situations.

下面结合附图，对本发明的具体实施方式进行详细说明。The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

微服务系统故障诊断及根因定位方法，其特征在于，包括：The microservice system fault diagnosis and root cause location method is characterized in that it includes:

S1、构建X个异常检测模型。S1. Construct X abnormality detection models.

S2、获取微服务系统中的监控数据作为训练数据集。S2. Obtain monitoring data in the microservice system as a training data set.

S3、训练数据集分别导入X个异常检测模型，对其进行训练优化，得到X个优化后的异常检测模型。S3. The training data sets are respectively imported into X abnormality detection models, and are trained and optimized to obtain X optimized abnormality detection models.

分析训练结果包括：提取包含多个预设特征的数据，分别导入X个优化后的异常检测模型，得到每个优化后的异常检测模型对于预设特征的异常检测效果作为训练结果。Analyzing the training results includes: extracting data containing multiple preset features, importing X optimized anomaly detection models respectively, and obtaining the anomaly detection effect of each optimized anomaly detection model on the preset features as the training result.

构建故障诊断模型：根据训练结果在X个优化后的异常检测模型中选取x个异常检测模型，并根据训练结果的优劣顺序进行级联构建故障诊断模型，其中x小于X。Construct a fault diagnosis model: select x abnormality detection models among X optimized abnormality detection models according to the training results, and construct a fault diagnosis model by cascading according to the order of the training results, where x is less than X.

S5、利用PC算法对故障节点进行因果关系学习，并构建节点之间的因果关系图作为异常传播图。S5. Using the PC algorithm to learn the causal relationship of the faulty nodes, and constructing a causal relationship graph between nodes as an anomaly propagation graph.

S6、实时获取监测数据。S6. Obtain monitoring data in real time.

S7、利用故障诊断模型分析监测数据的故障诊断结果。S7. Using the fault diagnosis model to analyze the fault diagnosis result of the monitoring data.

S8、根据异常传播图和故障诊断结果定位故障原因；具体包括：S8. Locating the cause of the fault according to the abnormal propagation diagram and the fault diagnosis result; specifically including:

S81、利用PageRank算法对异常传播图进行根因分析，并输出所有故障节点的初始因果分数v，表示为，其中μ表示微服务架构中服务的数量，α表示概率矩阵的转移概率，P为转换概率，节点i到j的转换概率P_ij表示为/>，其中w_ij表示经过PC算法得到的i和j之间的权重；S81. Use the PageRank algorithm to perform root cause analysis on the abnormal propagation graph, and output the initial causal score v of all faulty nodes, expressed as , where μ represents the number of services in the microservice architecture, α represents the transition probability of the probability matrix, P is the transition probability, and the transition probability P _ij from node i to j is expressed as /> , where w _ij represents the weight between i and j obtained through the PC algorithm;

S82、利用故障诊断结果分别计算故障数据的异常度，异常度表示为，其中/>表示监控数据，γ表示发生故障的类型，i代表滑动窗口中第i个时间步，k代表特征数量，n代表输入故障数据的时间长度，thred_γ在故障类型为γ进行诊断时返回的异常阈值；S82. Using the fault diagnosis results to calculate the abnormality of the fault data respectively, the abnormality is expressed as , where /> Represents monitoring data, γ represents the type of fault, i represents the i-th time step in the sliding window, k represents the number of features, n represents the time length of input fault data, thred _γ is the abnormal threshold returned when the fault type is γ for diagnosis ;

S83、根据初始因果分数和异常度进行加权整合计算每个节点的最终因果分数score_γ，表示为，β表示根因推断分数对最后的因果分数做出的贡献，/>表示当发生/>类型的故障时服务节点的初始因果分数。S83. Calculate the final causal score score _γ of each node according to the weighted integration of the initial causal score and abnormality, expressed as , β represents the contribution of the root cause inference score to the final causal score, /> Indicates when a /> occurs The initial causality score of the service node upon failure of type.

S84、根据最终因果分数对故障节点进行故障重要度排序定位故障原因。S84. Rank the fault importance of the fault node according to the final causal score to locate the fault cause.

本方法通过预训练机制，自动从8个异常检测模型中选择3个最适合检测CPU占用，内存泄露和网络延迟的模型，并将这3个模型级联起来，实现对观测数据故障诊断的目的。除此之外，我们通过捕捉故障数据的异常模式，并与根因定位方法结合，达到对发生故障服务定位的目的。同时，通过在5个真实的微服务数据集上与已有的方法进行故障诊断和根因定位性能的比较，证明本方法能实现较高的故障诊断和根因定位；实现了在微服务架构中庞大的数据，各服务复杂的依赖关系中，有效对系统故障进行诊断，并对导致故障的根因进行定位。Through the pre-training mechanism, this method automatically selects 3 models that are most suitable for detecting CPU usage, memory leaks and network delays from 8 abnormal detection models, and cascades these 3 models to achieve the purpose of fault diagnosis for observation data . In addition, we achieve the purpose of locating the faulty service by capturing the abnormal pattern of the fault data and combining it with the root cause location method. At the same time, by comparing the performance of fault diagnosis and root cause location with existing methods on five real microservice data sets, it is proved that this method can achieve higher fault diagnosis and root cause location; In the huge data and complex dependencies of each service, it can effectively diagnose system faults and locate the root cause of the fault.

本发明微服务系统故障诊断及根因定位方法的工作原理如下：The working principle of the microservice system fault diagnosis and root cause location method of the present invention is as follows:

在本工作原理的说明中，X=8，x=3In this illustration of how it works, X=8, x=3

训练数据集对8个异常检测模型进行训练优化；抽样数据对优化后的8个异常检测模型自动选择出异常检测效果最佳的3个模型作为故障诊断的级联模型。在异常检测模型的训练优化中，分别得到了检测三种故障效果最佳的三个模型。将训练数据中CPU，latency，memory正常运行时产生的数据分别提取出来放入对应的模型中，进行模型的训练，使得三个模型能够充分学习正常运行时CPU，latency，memory的特征。对于故障的诊断先后顺序，利用选取模型时的F1值来决定。将每个模型的F1值按从大到小的顺序作为诊断的先后顺序。The training data set trains and optimizes 8 anomaly detection models; the sampling data automatically selects the 3 models with the best anomaly detection effect from the optimized 8 anomaly detection models as cascade models for fault diagnosis. In the training optimization of the anomaly detection model, three models with the best detection effects for the three kinds of faults are respectively obtained. Extract the data generated during normal operation of CPU, latency, and memory from the training data and put them into corresponding models for model training, so that the three models can fully learn the characteristics of CPU, latency, and memory during normal operation. The sequence of fault diagnosis is determined by the F1 value when selecting the model. The F1 value of each model is used as the order of diagnosis in descending order.

在测试阶段，已经训练好的三个模型级联起来，对输入数据进行故障诊断，故障诊断流程如图3所示。具体步骤如下：输入数据经过异常检测模型1时，仅检测故障1，检测到的数据为异常时，输出该数据为故障1，否则，表示数据为正常或其他类型故障，紧接着，将模型输出的不是故障1的数据放入第二个异常检测模型中，诊断出故障2，以此类推，最后由x个异常检测模型输出x个故障，不满足x种故障的数据作为正常数据输出。In the test phase, the three models that have been trained are cascaded to perform fault diagnosis on the input data. The fault diagnosis process is shown in Figure 3. The specific steps are as follows: When the input data passes through the abnormality detection model 1, only fault 1 is detected, and when the detected data is abnormal, the data is output as fault 1, otherwise, it indicates that the data is normal or other types of faults, and then the model is output The data that is not fault 1 is put into the second anomaly detection model, fault 2 is diagnosed, and so on, and finally x faults are output by x anomaly detection models, and the data that does not satisfy the x faults are output as normal data.

在根因定位部分，在因果结构学习中，的目标是构建监控度量的因果图。因果图可以看作是度量之间的异常传播路径。从观测数据构造因果图较流行的方法是PC算法。PC算法利用统计检验来进行条件独立性分析，以此学习随机变量之间的因果关系。将微服务系统中的监控数据定义为，其中i代表滑动窗口中第i个时间步，k代表特征数量。将每个时间序列视为变量，将每个时间点的数据视为样本，PC算法输出具有k个节点的有向无环图DAG。具体如下：将k点连接成全连接的无向图G，并测试G中每个相邻节点的条件独立性。如果条件独立性存在，则删除两个节点之间的边缘。然后利用v分离的原理来确定图中边缘的依赖方向，并将骨架扩展到DAG。通常，当关键性能指标发生异常，那么表示这个服务发生了故障。当微服务发生了网络延迟，则整个服务发生了故障。当异常发生时，提取异常数据中的latency特征输入到根因分析中，利用PC算法构建出异常传播图。为了能够从异常传播图中定位故障服务器，采用PageRank算法对异常传播图进行根因分析。将P_ij定义为节点i到j的转换概率，具体计算公式如下：In the root cause localization part, in causal structure learning, the goal is to build a causal graph of the monitoring metrics. A causal graph can be viewed as an anomalous propagation path between metrics. A popular method for constructing causal graphs from observational data is the PC algorithm. The PC algorithm uses statistical tests to perform conditional independence analysis to learn the causal relationship between random variables. Define the monitoring data in the microservice system as , where i represents the ith time step in the sliding window and k represents the number of features. Treating each time series as a variable and the data at each time point as a sample, the PC algorithm outputs a directed acyclic graph DAG with k nodes. The details are as follows: connect k points into a fully connected undirected graph G, and test the conditional independence of each adjacent node in G. Edges between two nodes are removed if conditional independence exists. The principle of v-separation is then used to determine the dependency direction of edges in the graph and extend the skeleton to DAG. Usually, when key performance indicators are abnormal, it means that the service has failed. When network delays occur in a microservice, the entire service fails. When an abnormality occurs, the latency feature in the abnormal data is extracted and input into the root cause analysis, and the PC algorithm is used to construct an abnormal propagation map. In order to locate the faulty server from the abnormal propagation graph, the PageRank algorithm is used to analyze the root cause of the abnormal propagation graph. Define P _ij as the transition probability from node i to j, and the specific calculation formula is as follows:

其中w_ij表示经过PC算法得到的i和j之间的权重。得到概率矩阵后，计算每个节点的根本原因得分v，具体公式如下：Among them, w _ij represents the weight between i and j obtained through the PC algorithm. After obtaining the probability matrix, calculate the root cause score v of each node, the specific formula is as follows:

其中α表示概率矩阵的转移概率，在本原理说明中设置为0.85。where α denotes the transition probability of the probability matrix, which is set to 0.85 in this note of principle.

除了拓扑原因之外，在某些系统故障发生期间，系统的异常模式也会影响最后的根因结果。为了捕捉故障数据的异常模式，在故障诊断期间分别返回在进行CPU占用，内存泄露和网络延迟诊断时的异常阈值。利用这三个异常检测阈值分别计算发生CPU占用，内存泄露和网络延迟三种故障时数据的异常度η，具体的计算公式如下：In addition to topological reasons, during certain system failures, the abnormal pattern of the system can also affect the final root cause results. In order to capture the abnormal pattern of the fault data, the abnormal thresholds during the diagnosis of CPU usage, memory leak and network delay are returned respectively during the fault diagnosis. Use these three anomaly detection thresholds to calculate the abnormality η of the data when three types of faults occur: CPU usage, memory leak, and network delay. The specific calculation formula is as follows:

， ,

，/>，/>分别表示当发生CPU异常、Memory异常和Latency异常时，输入数据的异常度。 , /> , /> Indicates the abnormality of the input data when a CPU exception, Memory exception, and Latency exception occur, respectively.

因果整合：结合异常度和初始因果分数，进行因果整合，得到最后的最终因果分数，具体公式如下：Causal integration: Combine the abnormality and the initial causal score to perform causal integration to obtain the final final causal score. The specific formula is as follows:

， ,

其中，β表示根因推断分数对最后的因果分数做出的贡献，在本原理说明中设置为β为0.5。in , β represents the contribution of the root cause inference score to the final causal score, which is set to 0.5 in this note of principle.

在得到最终因果分数后，根据所有最终因果分数进行从大到小排序，排名越靠前，证明该服务为导致服务发生故障的根本原因的可能性越大。After the final causal score is obtained, all final causal scores are sorted from large to small. The higher the ranking, the more likely it is to prove that the service is the root cause of the service failure.

试验过程Experimental procedure

1、数据集1. Dataset

在实验中，部署了sock-shop电子商务网站，它作为微服务和云原生技术的基准。该网站包括13项服务，包括前台、目录、购物车、用户、订单、支付、运输等功能服务，以及促进不同服务之间沟通的通信服务。In the experiment, a sock-shop e-commerce website is deployed, which serves as a benchmark for microservices and cloud-native technologies. The website consists of 13 services, including functional services such as front desk, catalog, shopping cart, users, orders, payment, shipping, and communication services to facilitate communication between different services.

使用Kubernetes在云中多个虚拟机(vm)上部署sock-shop。Kubernetes集群包括1个主节点和3个工作节点，每个工作节点配置如下:Ubuntu 18.04, 4vCPU, 16G RAMMemory, 80G Disk。在主节点上使用开源监控和可视化工具Prometheus和Grafana来监控系统并收集服务级和资源级数据。此外，在主节点上使用负载生成工具Locust来模拟微服务应用程序的工作负载。13个sock-shop服务部署在工作节点上，由Kubernetes自动分配给不同的虚拟机。Deploy sock-shop on multiple virtual machines (vm) in the cloud using Kubernetes. The Kubernetes cluster includes 1 master node and 3 working nodes, and each working node is configured as follows: Ubuntu 18.04, 4vCPU, 16G RAMMemory, 80G Disk. Use open source monitoring and visualization tools Prometheus and Grafana on the master node to monitor the system and collect service-level and resource-level data. In addition, the load generation tool Locust is used on the master node to simulate the workload of the microservice application. 13 sock-shop services are deployed on the working nodes, and are automatically assigned to different virtual machines by Kubernetes.

为了模拟实际运行的应用程序，注入了三种常见异常:CPU占用、内存泄漏和网络延迟。使用Pumba工具模拟网络故障，并为Docker容器进行压力测试资源，以实现异常注入。对于CPU hog，消耗每个服务的CPU资源;对于内存泄漏，为每个服务连续分配内存;对于网络延迟，通过流量控制来延迟网络报文。每个异常持续1到5分钟，应用程序正常运行10到30分钟，然后为每个异常重复该过程至少5次。根据Prometheus配置每5秒实时采集一次数据，同时采集服务级和资源级数据。在服务级别，收集每个服务的延迟。在资源级别，收集容器资源相关的指标，包括CPU使用情况、内存使用情况、磁盘读写情况以及网络接收和传输字节数。In order to simulate a real running application, three common exceptions are injected: CPU hogging, memory leak, and network latency. Use the Pumba tool to simulate network failures and stress test resources for Docker containers to achieve exception injection. For CPU hog, CPU resources of each service are consumed; for memory leaks, memory is continuously allocated for each service; for network delay, network packets are delayed through flow control. Each exception lasts 1 to 5 minutes, the application runs normally for 10 to 30 minutes, and then repeats the process at least 5 times for each exception. According to the Prometheus configuration, data is collected in real time every 5 seconds, and service-level and resource-level data are collected at the same time. At the service level, the latency of each service is collected. At the resource level, collect container resource-related indicators, including CPU usage, memory usage, disk read and write, and network received and transmitted bytes.

接下来的对比实验用于对比本方法与已有方法的性能对比。The following comparative experiments are used to compare the performance of this method with existing methods.

2、数据预处理2. Data preprocessing

为了提高模型的精度，采用数据标准化处理训练集和测试集，将不同规格的数据转换到统一的规格中，以减少规模，特征，分布差异对模型的影响。采用min-max标准化。In order to improve the accuracy of the model, data standardization is used to process the training set and test set, and the data of different specifications are converted into a unified specification to reduce the impact of scale, characteristics, and distribution differences on the model. Use min-max normalization.

3、模型训练过程3. Model training process

图1中①为本模型的预训练模块，设计该模块的主要目的是能够自动自适应从候选模型中选择最适合的，能够实现cpu、内存memory和延迟latency中某类异常的最佳检测模型。具体流程如图2所示，在训练数据上进行随机采样。具体步骤为：将已经在预处理过的训练数据上产生一个随机数，根据随机数对训练数据进行抽样，抽取一段大小为500的数据集作为训练子集。然后再从训练数据中随机抽取出一条发生了CPU占用的数据，并根据该数据产生一段大小为100的数据集，同时，以同样的方法分别抽样到包括内存泄漏和网络延迟的数据，将这三段数据集拼接在一起形成一个大小为300的测试子集。然后将随机采样的训练子集和测试子集中的CPU特征提取出来，分别放入候选模型中进行异常检测，并输出F1值。当在进行CPU异常检测时，数据中的其他异常视为正常数据。在进行5次采样后。输出5次采样后异常检测的平均F1值。接下来，找出平均F1值最大的模型，并输出该模型名称。以同样的方式分别得到最适合检测网络延迟和内存泄露的模型。① in Figure 1 is the pre-training module of this model. The main purpose of designing this module is to automatically select the most suitable model from the candidate models, and to realize the best detection model for certain types of abnormalities in CPU, memory memory and latency. . The specific process is shown in Figure 2, and random sampling is performed on the training data. The specific steps are: generate a random number on the preprocessed training data, sample the training data according to the random number, and extract a data set with a size of 500 as the training subset. Then randomly extract a piece of data with CPU usage from the training data, and generate a data set with a size of 100 based on the data. The three-segment datasets are stitched together to form a test subset of size 300. Then the CPU features in the randomly sampled training subset and test subset are extracted, put into the candidate model for anomaly detection, and output the F1 value. When performing CPU anomaly detection, other anomalies in the data are regarded as normal data. After taking 5 samples. Output the average F1 value of anomaly detection after 5 samples. Next, find the model with the largest average F1 value and output the model name. In the same way, the most suitable models for detecting network delay and memory leak are respectively obtained.

如图1中②所示，在数据预训练模块中，分别得到了检测三种故障的最佳模型。将训练数据中正常运行时CPU，latency，memory的监控数据分别提取出来放入对应的模型中，进行模型的训练，使得三个模型能够充分学习正常运行时CPU，latency，memory的特征。对于故障的诊断先后顺序，利用预训练阶段选取模型的F1值来决定。将每个模型的F1值按从大到小的顺序作为诊断的先后顺序。As shown in ② in Figure 1, in the data pre-training module, the best models for detecting three kinds of faults are respectively obtained. Extract the monitoring data of CPU, latency, and memory during normal operation from the training data and put them into the corresponding models for model training, so that the three models can fully learn the characteristics of CPU, latency, and memory during normal operation. For the order of fault diagnosis, the F1 value of the model selected in the pre-training stage is used to determine. The F1 value of each model is used as the order of diagnosis in descending order.

模型训练完成后，如图1中③所示，将训练好的三个模型级联起来，对收集到的数据进行故障诊断。故障诊断的具体步骤如图3所示，首先，将数据输入到第一个模型中，进行异常检测。在异常检测时，设定一个异常阈值，当预测数据与真实数据的误差超过该阈值，则认定该点为异常点。为了能够提高异常检测精度，采用最佳F1值方法搜索最佳阈值，并返回该阈值。当模型1对输入数据进行了异常检测后，输出异常数据和正常数据，其中异常数据表示故障1的发生，正常数据包括其他故障数据和没有发生任何故障的数据。接下来，将输出为正常的数据放入下一个模型中进行故障2的检测，同第一个模型一样，输出故障2和正常数据，并将该模型检测为正常的数据放入模型3中，输出故障3和正常数据。在三次故障诊断中，得到了三个异常阈值，用于捕捉实体度量数据的异常模式，以提高在根因定位中的定位精确率。After the model training is completed, as shown in ③ in Figure 1, the three trained models are cascaded to perform fault diagnosis on the collected data. The specific steps of fault diagnosis are shown in Figure 3. First, the data is input into the first model for anomaly detection. In anomaly detection, an abnormal threshold is set, and when the error between the predicted data and the real data exceeds the threshold, the point is identified as an abnormal point. In order to improve the accuracy of anomaly detection, the best F1 value method is used to search for the best threshold and return the threshold. After the model 1 detects the abnormality of the input data, it outputs abnormal data and normal data, wherein the abnormal data indicates the occurrence of fault 1, and the normal data includes other fault data and data without any fault. Next, put the normal output data into the next model for detection of fault 2. Same as the first model, output fault 2 and normal data, and put the data detected by this model as normal into model 3. Output fault 3 and normal data. In the three fault diagnosis, three abnormal thresholds are obtained, which are used to capture the abnormal pattern of entity measurement data, so as to improve the location accuracy in root cause location.

将发生异常的数据放入基于因果推断的根因定位中，如图1中④所示。在该部分中通过PC算法对故障节点进行因果关系学习，并构建节点之间的因果关系图，将因果关系图作为异常传播图。然后利用PageRank对异常传播图进行根因分析，并输出故障节点的因果分数。Put the abnormal data into the root cause location based on causal inference, as shown in Figure 1 ④. In this part, the PC algorithm is used to learn the causal relationship of the fault nodes, and the causal relationship graph between the nodes is constructed, and the causal relationship graph is used as the abnormal propagation graph. Then use PageRank to perform root cause analysis on the anomaly propagation graph, and output the causal score of the faulty node.

为了进一步提高定位精度，利用故障诊断阶段返回的异常阈值分别计算当发生CPU占用，内存泄露和网络延迟时故障数据的异常度，具体公式如下:In order to further improve the positioning accuracy, the abnormality threshold value returned in the fault diagnosis stage is used to calculate the abnormality of fault data when CPU usage, memory leak and network delay occur. The specific formula is as follows:

其中thred在对相应故障进行诊断时返回的异常阈值。Among them, thred is the abnormal threshold returned when diagnosing the corresponding fault.

如图1中⑤所示，将基于因果定位的根因推断方法得到的因果分数和数据的异常分数进行加权整合，并通过β来调节基于因果推断的根因定位分数和异常度对最后因果分数的贡献。在得到最后的因果分数，根据因果分数对故障节点进行故障重要度排序。排序越靠前证明该节点是导致故障传播根因的概率越大。As shown in ⑤ in Figure 1, the causal score obtained by the root cause inference method based on causal location and the abnormality score of the data are weighted and integrated, and the final causal score of the root cause location score and anomaly degree based on causal inference is adjusted by β contribution. After the final causal score is obtained, the fault importance of the fault nodes is sorted according to the causal score. The higher the ranking, the greater the probability that the node is the root cause of fault propagation.

模型性能指标Model Performance Index

故障诊断Troubleshooting

模型的性能比较采用分类的几个基于混淆矩阵的主要性能指标：宏精确率、宏召回率、宏F1-Score。The performance comparison of the model uses several main performance indicators based on the confusion matrix of classification: macro precision, macro recall, and macro F1-Score.

精确率指模型预测为正的样本中实际也为正的样本占被预测为正的样本的比例，计算公式为The accuracy rate refers to the proportion of the samples that are actually positive among the samples predicted to be positive by the model to the samples that are predicted to be positive. The calculation formula is

召回率指实际为正的样本中被预测为正的样本所占实际为正的样本的比例，计算公式为：The recall rate refers to the proportion of the samples that are predicted to be positive among the samples that are actually positive, and the calculation formula is:

F1 score是精确率和召回率的调和平均值，计算公式为：F1 score is the harmonic mean of precision and recall, calculated as:

宏精确率，宏召回率和宏F1分别是指所有类别的每一个统计指标值的算数平均值。Macro precision, macro recall and macro F1 respectively refer to the arithmetic mean of each statistical index value of all categories.

此外还使用F1 Average Rank来验证模型的鲁棒性。F1 Average Rank表示五个数据集中每个模型的宏F1-score得分的平均排名In addition, the F1 Average Rank is used to verify the robustness of the model. F1 Average Rank represents the average ranking of the macro F1-score scores of each model in the five datasets

根因定位root cause location

在根因定位中，使用PR@k和Avg@k两个广泛使用的度量来评估模型的性能。PR@k表示由根因定位算法预测的根本原因中前k个结果包含真实根因的概率。当k较小时，较高的PR@k表示算法更准确地识别根本原因，具体的公式如下：In root cause localization, two widely used metrics, PR@k and Avg@k, are used to evaluate the performance of the model. PR@k represents the probability that the top k results of the root cause predicted by the root cause location algorithm contain the real root cause. When k is small, a higher PR@k indicates that the algorithm more accurately identifies the root cause, and the specific formula is as follows:

其中，A表示在系统中存在的故障集合，a表示A中的一种故障，V_a表示导致故障a发生的真实根因，R_a表示由根因定位算法预测的导致故障a发生的根因，i表示在预测根因R_a中的第i个根因。Among them, A represents the set of faults existing in the system, a represents a type of fault in A, V _a represents the real root cause of fault a, R _a represents the root cause of fault a predicted by the root cause location algorithm , i represents the i-th root cause in the predicted root cause R _a .

Avg@k从整体的角度评估模型在top k预测原因中的表现，通过计算平均PR@k来评估算法的总体性能，具体的公式为：Avg@k evaluates the performance of the model in top k prediction reasons from an overall perspective, and evaluates the overall performance of the algorithm by calculating the average PR@k. The specific formula is:

其中，j表示累加计数。Among them, j represents the cumulative count.

模型比较结果Model Comparison Results

从图4、图5、图6、表1，表2可以看出，与已有模型相比，本模型在真实数据集实验结果如下：From Figure 4, Figure 5, Figure 6, Table 1, and Table 2, it can be seen that compared with the existing model, the experimental results of this model in the real data set are as follows:

从表1可以看出，对于购物shipping、用户登录注册user、购物车carts数据集而言，本模型优于其他11种模型，11种模型包括K-均值KMeans、高斯混合模型GaussianMixture、综合层次聚类算法Birch、基于Wasserstein距离的生成对抗网络故障诊断模型WPS、基于并联图注意力网络集成学习的故障诊断模型CGNN-MHSA-AR、基于生成对抗网络的故障诊断模型MAD_GAN、无监督多元时间序列故障诊断模型USAD、基于深度自编码高斯混合模型的故障诊断模型DAGMM、基于图注意力网络的故障诊断模型MTAD、基于随机递归神经网络的故障诊断模型OmniAnomaly和基于深度卷积自编码记忆网络的故障诊断模型CAE_M。平均而言，本模型的宏精确率为78.5%，宏召回率为95.7%，宏F1评分为82.4%，与其他所有模型相比是最高的。对于订单orders和商品目录catalogue这两个数据集，本模型的宏F1得分略低于最佳模型，具体为：在商品目录catalogue数据集上，使用基于图注意力网络的故障诊断模型MTAD能够达到最优宏F1值有0.930，优于本模型3%；在订单orders数据集上，使用基于随机递归神经网络的故障诊断模型OmniAnomaly能够达到最优宏F1值有0.979，优于本模型0.9%。It can be seen from Table 1 that this model is superior to other 11 models for shopping shipping, user login and registration user, and shopping cart carts data sets, including K-means KMeans, Gaussian mixture model GaussianMixture, comprehensive hierarchical aggregation Birch-like algorithm, Wasserstein distance-based generative adversarial network fault diagnosis model WPS, fault diagnosis model CGNN-MHSA-AR based on parallel graph attention network ensemble learning, fault diagnosis model MAD_GAN based on generative adversarial network, unsupervised multivariate time series fault Diagnostic model USAD, fault diagnosis model DAGMM based on deep self-encoding Gaussian mixture model, fault diagnosis model MTAD based on graph attention network, fault diagnosis model OmniAnomaly based on stochastic recurrent neural network, and fault diagnosis based on deep convolutional self-encoding memory network Model CAE_M. On average, our model achieves a macro precision of 78.5%, a macro recall of 95.7%, and a macro F1 score of 82.4%, the highest compared to all other models. For the two data sets of orders and catalogue, the macro F1 score of this model is slightly lower than that of the best model, specifically: on the catalog data set, the fault diagnosis model MTAD based on graph attention network can achieve The optimal macro F1 value is 0.930, which is 3% better than this model; on the orders data set, using the fault diagnosis model OmniAnomaly based on random recurrent neural network can achieve the optimal macro F1 value of 0.979, which is 0.9% better than this model.

从表2中可以看出，对于CPU故障来说，本方法在PR@1上的定位精度为0.8，这表明有80%的可能在排序的前1个指标上找到根本原因。同样的，对于内存泄露memory leak和网络延迟Network latency故障来说，本方法分别有60%，40%的可能在排序的前1个指标上找到根本原因。从整体来看，本方法在5个数据集中对于cpu故障中的定位精度平均Avg@5为0.88。与表现最好的基于贪婪搜索的根因定位方法（GES-based）相比，本方法在根因定位精度上提高了32%；从内存泄露的定位精度来看，本方法与表现最好的基于因果关系预测的根因定位方法（PC-based）相比提高了24%；在网络延迟的定位精度上，本方法与表现最好的基于因果关系预测的根因定位方法（PC-based）相比提高了28%；在cpu故障、内存泄露和网络延迟的定位精度上，本方法均优于基于线性非高斯无环模型的根因定位方法（LiNGAM-based）。As can be seen from Table 2, for CPU faults, the localization accuracy of this method on PR@1 is 0.8, which indicates that there is an 80% possibility of finding the root cause on the top 1 indicator of the ranking. Similarly, for memory leaks and network latency failures, this method has a 60% and 40% probability of finding the root cause on the first index sorted respectively. On the whole, the average Avg@5 positioning accuracy of this method for cpu faults in the five data sets is 0.88. Compared with the best-performing greedy search-based root cause location method (GES-based), this method improves the root cause location accuracy by 32%. From the perspective of memory leak location accuracy, this method is comparable to the best-performing The root cause location method based on causality prediction (PC-based) has increased by 24%; in terms of network delay location accuracy, this method is the best performing root cause location method based on causality prediction (PC-based) Compared with the improvement of 28%, this method is superior to the root cause location method based on the linear non-Gaussian acyclic model (LiNGAM-based) in terms of the positioning accuracy of cpu faults, memory leaks and network delays.

图4、图5、图6分别展示了macro-precision，macro-recall和macro-f1在本模型和11种模型上的表现，从图中可以看出，本模型在三个评估指标上表现都较好，表明本模型能够更好的选择出适合诊断三种故障的模型并级联起来使得其在故障诊断方面有更高的性能。Figure 4, Figure 5, and Figure 6 show the performance of macro-precision, macro-recall and macro-f1 on this model and 11 models respectively. It can be seen from the figure that this model performs well on the three evaluation indicators. It is better, indicating that this model can better select the model suitable for diagnosing three kinds of faults and cascade them to make it have higher performance in fault diagnosis.

表1、本技术与11种故障诊断在5个数据集上的检测性能对比Table 1. Comparison of detection performance between this technology and 11 kinds of fault diagnosis on 5 data sets

表2、本技术与3种根因定位方法在5个数据集上的平均定位性能对比Table 2. Comparison of average location performance between this technology and three root cause location methods on five data sets

本发明的技术方案不限于上述具体实施例的限制，凡是根据本发明的技术方案做出的技术变形，均落入本发明的保护范围之内。The technical solution of the present invention is not limited to the limitations of the above-mentioned specific embodiments, and all technical modifications made according to the technical solution of the present invention fall within the scope of protection of the present invention.

Claims

1. The fault diagnosis and root cause positioning method for the micro service system is characterized by comprising the following steps:

s1, constructing X abnormal detection models;

s2, acquiring monitoring data in the micro-service system as a training data set;

s3, respectively importing the training data sets into X anomaly detection models, and training and optimizing the training data sets to obtain X optimized anomaly detection models;

s4, analyzing training results of the X optimized anomaly detection models; constructing a fault diagnosis model according to the training result;

s5, performing causal relation learning on the fault nodes, and constructing a causal relation graph between the nodes as an abnormal propagation graph;

s6, acquiring monitoring data in real time;

s7, analyzing fault diagnosis results of the monitoring data by using a fault diagnosis model;

s8, positioning a fault reason according to the abnormal propagation diagram and the fault diagnosis result.

2. The method for fault diagnosis and root cause localization of a micro-service system according to claim 1, wherein analyzing the training results comprises: extracting data containing a plurality of preset features, respectively importing X optimized abnormality detection models, and obtaining an abnormality detection effect of each optimized abnormality detection model on the preset features as a training result.

3. The method for fault diagnosis and root cause positioning of a micro-service system according to claim 1 or 2, wherein X anomaly detection models are selected from X optimized anomaly detection models according to training results to perform cascading to construct a fault diagnosis model, wherein X is smaller than X.

4. The method for fault diagnosis and root cause localization of micro-service system according to claim 3, wherein in S4, the selected x abnormal detection models are cascaded according to the order of the training results to construct a fault diagnosis model.

5. The method for diagnosing and locating root causes of a micro service system according to claim 1, wherein in S5, a causal relation graph among nodes is constructed as an abnormal propagation graph by performing causal relation learning on the faulty nodes by using a PC algorithm.

6. The method for fault diagnosis and root cause localization of a micro service system according to claim 1, wherein S8 comprises:

s81, performing root cause analysis on the abnormal propagation graph by using a PageRank algorithm, and outputting initial causal scores of all fault nodes;

s82, calculating the anomaly degree of the fault data by using the fault diagnosis result respectively;

s83, weighting and integrating the initial causal score and the degree of abnormality to calculate the final causal score of each node;

s84, sorting the fault importance degree of the fault nodes according to the final causal score to locate the fault reason.

7. The method for fault diagnosis and root cause localization of a micro-service system according to claim 6, wherein in S81, an initial causal score v is expressed asWherein μ represents the number of services in the micro-service architecture, α represents the transition probability of the probability matrix, P is the transition probability, and the transition probability P of nodes i to j _ij Denoted as->Wherein w is _ij Representing the weight between i and j obtained through the PC algorithm.

8. The method for fault diagnosis and root cause localization of micro-service system as claimed in claim 7, wherein in S82, the degree of anomaly η of the calculated fault data is expressed asWherein->Representing monitoring data, gamma representing the type of failure, i representing the ith time step in the sliding window, k representing the feature quantity, n generationThe length of time the fault data is entered, the thread _γ An anomaly threshold value returned when the fault type is gamma for diagnosis.

9. The method for fault diagnosis and root cause localization of a micro service system according to claim 8, wherein in S83, a final causal score is calculated _γ Represented asBeta represents the contribution of the root cause inference score to the last causal score, ++>Indicating when +.>Initial causal score of the service node at the time of the type of fault.