CN115756929A - Abnormal root cause positioning method and system based on dynamic service dependency graph - Google Patents

Abnormal root cause positioning method and system based on dynamic service dependency graph Download PDF

Info

Publication number
CN115756929A
CN115756929A CN202211470197.6A CN202211470197A CN115756929A CN 115756929 A CN115756929 A CN 115756929A CN 202211470197 A CN202211470197 A CN 202211470197A CN 115756929 A CN115756929 A CN 115756929A
Authority
CN
China
Prior art keywords
abnormal
service
root cause
graph
propagation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211470197.6A
Other languages
Chinese (zh)
Other versions
CN115756929B (en
Inventor
张齐勋
刘洪毅
杨勇
贾统
李影
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202211470197.6A priority Critical patent/CN115756929B/en
Publication of CN115756929A publication Critical patent/CN115756929A/en
Application granted granted Critical
Publication of CN115756929B publication Critical patent/CN115756929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides an abnormal root cause positioning method and system based on a dynamic service dependency graph, and belongs to the field of intelligent operation and maintenance. The method comprises the following steps of positioning an abnormal root cause, extracting a service dependency relationship during deployment based on service configuration information, discovering the service dependency relationship during operation based on the correlation of key performance indexes among services and log frequency information, and dynamically constructing a service dependency graph; automatically constructing an abnormal propagation diagram based on the input abnormal service and dynamic service dependency diagram; and constructing an abnormal propagation path based on depth-first search, calculating the abnormal root cause score, positioning abnormal root cause service, and reporting the abnormal propagation path. The method can better capture the change of the service dependency relationship during operation, is beneficial to depicting more accurate abnormal propagation relationship, and improves the capability of positioning abnormal root causes; the method is suitable for positioning various abnormal root causes, and has good universality; the method directly locates abnormal services and provides abnormal propagation paths, and has good practicability and interpretability. And the manual participation is not needed, and the labor cost is saved.

Description

一种基于动态服务依赖图的异常根因定位方法及系统A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph

技术领域technical field

本发明属于智能运维领域,具体涉及一种基于动态服务依赖图的异常根因定位方法。The invention belongs to the field of intelligent operation and maintenance, and in particular relates to a method for locating abnormal root causes based on a dynamic service dependency graph.

背景技术Background technique

微服务架构的系统(简称微服务系统)由大量的细粒度的微服务组成,微服务之间协同工作,实现系统功能,被广泛应用于通信、交通、物流、金融等领域。现代互联网应用,往往使用微服务系统作为底层实现。微服务架构的核心设计思想是,将应用拆分成若干个高内聚,低耦合,功能单一,可独立开发、部署、更新的微服务,以此实现应用的敏捷开发与持续交付。微服务架构使得应用的开发和迭代更加的便捷,但也给系统的运行维护带来的新的挑战,即复杂的微服务系统使得应用容易发生异常且难以诊断异常根因。The system of microservice architecture (referred to as microservice system) is composed of a large number of fine-grained microservices, which work together to realize system functions, and are widely used in communication, transportation, logistics, finance and other fields. Modern Internet applications often use microservice systems as the underlying implementation. The core design idea of the microservice architecture is to split the application into several highly cohesive, low-coupling, single-function microservices that can be independently developed, deployed, and updated, so as to achieve agile development and continuous delivery of applications. The microservice architecture makes the development and iteration of applications more convenient, but it also brings new challenges to the operation and maintenance of the system, that is, the complex microservice system makes the application prone to abnormalities and it is difficult to diagnose the root cause of the abnormalities.

微服务系统复杂,具体体现在微服务数量庞大,微服务间依赖关系动态变化,微服务的异构实现。由于微服务的功能单一,为了完成系统的功能,微服务系统可能拥有庞大的微服务数量。例如,Netflix的微服务系统运行了6000多个微服务实例,每天需要处理超过20亿个服务请求;Uber的微服务系统也运行了4000多个微服务实例。同时,微服务系统为应对用户需求的变化与应用的发展,可以动态部署与注册微服务实例,不断有新的微服务加入,使得微服务之间的依赖关系动态变化且错综复杂。并且,微服务系统的服务化接口封装,允许系统内不同微服务采取不同的编程语言、服务框架以及通信机制,以应对不同的应用场景。这种系统内的异构性设计与实现,进一步加深了微服务系统的复杂程度。The microservice system is complex, which is specifically reflected in the large number of microservices, dynamic changes in dependencies between microservices, and heterogeneous implementation of microservices. Due to the single function of microservices, in order to complete the functions of the system, the microservice system may have a large number of microservices. For example, Netflix's microservice system runs more than 6,000 microservice instances and needs to process more than 2 billion service requests every day; Uber's microservice system also runs more than 4,000 microservice instances. At the same time, in order to respond to changes in user needs and application development, the microservice system can dynamically deploy and register microservice instances, and new microservices are constantly added, making the dependencies between microservices dynamic and intricate. Moreover, the service interface encapsulation of the microservice system allows different microservices in the system to adopt different programming languages, service frameworks, and communication mechanisms to cope with different application scenarios. The heterogeneous design and implementation in this system further deepens the complexity of the microservice system.

微服务系统的复杂性,导致了其容易发生异常且难以诊断异常的根因。异常是指,系统的运行状态相比于系统的设计期望发生了偏移。微服务的松耦合设计及异构实现,使得微服务间往往通过网络远程调用的形式完成协作,容易产生由网络不稳定、版本不兼容、配置错误、代码缺陷等问题引发的异常。微服务的数量庞大,进一步加深了上述异常产生的可能性。并且,由于服务间依赖关系复杂且动态变化,任何微小的系统异常均可能引起连锁反应,导致多个异常同时发生,其中只有一个或少数几个异常是引发其他异常的根因。这是由于异常会随着服务间依赖关系不断扩散,形成异常传播链,这些因依赖关系而产生的微服务异常之间存在着因果关系,往往异常传播链的源头是异常根因。为了保障系统的可靠性,当检测到系统运行发生异常,运维人员需要第一时间定位异常的根因,并作出相应的处理,如变更的回滚等。否则,异常可能引起服务中断或服务质量下降,并带来严重的损失。国际数据公司曾报告,一小时的服务中断平均会造成余约10万美元的损失。然而,尽管微服务系统可以提供丰富的运维数据,例如日志、关键性能指标、调用链,但差异化的数据格式、稀疏化的有价值信息以及动态变化的服务依赖关系,使得运维人员难以在复杂的微服务系统里定位异常的根因,严重时导致异常不能及时修复,并给企业带来更严重的经济损害与声誉损害。The complexity of the microservice system makes it prone to exceptions and the root cause of the exceptions is difficult to diagnose. An anomaly means that the operating state of the system deviates from the design expectation of the system. The loosely coupled design and heterogeneous implementation of microservices make the cooperation between microservices often complete through network remote calls, which is prone to abnormalities caused by network instability, version incompatibility, configuration errors, and code defects. The huge number of microservices further increases the possibility of the above abnormalities. Moreover, due to the complex and dynamic changes in the dependencies between services, any minor system abnormality may cause a chain reaction, causing multiple abnormalities to occur at the same time, and only one or a few of them are the root causes of other abnormalities. This is because the exception will continue to spread with the dependencies between services, forming an exception propagation chain. There is a causal relationship between these microservice exceptions caused by dependencies, and the source of the exception propagation chain is often the root cause of the exception. In order to ensure the reliability of the system, when an abnormality is detected in the system operation, the operation and maintenance personnel need to locate the root cause of the abnormality as soon as possible and take corresponding measures, such as rollback of changes. Otherwise, the exception may cause service interruption or service quality degradation, and cause serious losses. The International Data Corporation has reported that an hour of service outage costs an average of about $100,000. However, although the microservice system can provide a wealth of operation and maintenance data, such as logs, key performance indicators, and call chains, the differentiated data format, sparse valuable information, and dynamically changing service dependencies make it difficult for operation and maintenance personnel to Locating the root cause of anomalies in a complex microservice system can lead to serious anomalies that cannot be repaired in time, and bring more serious economic damage and reputation damage to the enterprise.

然而,现有微服务系统的异常根因定位方法,仍然存在不足,往往忽略了服务依赖的动态变化以及根因定位的可解释性。现有方法,大多依赖于服务依赖图。服务依赖图用于刻画服务之间的依赖关系,这种依赖关系也可以用于刻画服务之间的异常传播,从而有助于运维人员进行异常根因定位。基于服务依赖图的异常根因定位首先通过指标、系统日志或追踪数据构造服务依赖图,然后当异常发生时,从服务依赖图中的异常节点出发,通过图搜索、随机游走等算法得到引起异常的候选异常根因集合,然后通过异常分数、与异常节点的相关性或被访问次数等方式对候选异常根因进行排序。然而,现有方法,往往依赖于静态服务依赖图,忽略了服务依赖之间的动态变化,导致构建的服务依赖与真实情况存在差异,限制了现有异常根因定位方法的准确性。与此同时,现有方法,往往只提供了异常的根因服务或根因指标,缺乏可解释性,导致运维人员难以快速地理解报告结果及判断根因的准确性。However, the existing abnormal root cause location methods for microservice systems still have deficiencies, often ignoring the dynamic changes of service dependencies and the interpretability of root cause location. Existing methods mostly rely on service dependency graphs. The service dependency graph is used to describe the dependency relationship between services, which can also be used to describe the exception propagation between services, which helps the operation and maintenance personnel to locate the root cause of the exception. Abnormal root cause location based on the service dependency graph first constructs the service dependency graph through indicators, system logs or tracking data, and then when an exception occurs, starting from the abnormal node in the service dependency graph, the cause is obtained through graph search, random walk and other algorithms A collection of candidate anomaly root causes of anomalies, and then sort the candidate anomaly root causes by anomaly score, correlation with anomalous nodes, or number of visits. However, existing methods often rely on static service dependency graphs, ignoring the dynamic changes between service dependencies, resulting in differences between the constructed service dependencies and the real situation, which limits the accuracy of existing abnormal root cause location methods. At the same time, existing methods often only provide abnormal root cause services or root cause indicators, which lack interpretability, making it difficult for operation and maintenance personnel to quickly understand the report results and determine the accuracy of the root cause.

发明内容Contents of the invention

为了减少异常根因定位不准确导致的额外分析与回滚,降低运维成本,提高异常定位的可解释性,保障微服务系统的可靠性,本发明提供了一种基于动态服务依赖图的异常根因定位方法。基于服务配置信息提取部署时服务依赖关系,基于服务间关键性能指标的相关性及日志频率信息,发现运行时服务依赖关系,动态地构建服务依赖图;基于输入的异常服务及服务依赖图,自动构建异常传播图;基于深度优先搜索构建异常传播路径,并计算异常的根因分数,定位异常根因服务,并报告异常传播路径。In order to reduce the additional analysis and rollback caused by inaccurate location of abnormal root causes, reduce operation and maintenance costs, improve the interpretability of abnormal location, and ensure the reliability of the microservice system, the present invention provides a dynamic service dependency graph based exception Root cause location method. Extract service dependencies during deployment based on service configuration information, discover runtime service dependencies based on the correlation of key performance indicators between services and log frequency information, and dynamically build service dependency graphs; based on input abnormal services and service dependency graphs, automatically Build an anomaly propagation map; construct an anomaly propagation path based on depth-first search, calculate the root cause score of the anomaly, locate the abnormal root cause service, and report the anomaly propagation path.

本发明中的动态服务依赖图构建、异常传播图构建、异常根因定位都是自动进行,无需人工参与,节省了人力成本。本发明的异常根因定位,基于动态服务依赖图,可以更好地捕捉运行时服务依赖关系的变化,有助于刻画更加精确的异常传播关系,提升了异常根因定位的能力;适用于各种类型异常的根因定位,具有很好的通用性;直接定位到异常的服务,并提供异常的传播路径,具有很好的实用性和可解释性。In the present invention, the construction of the dynamic service dependency graph, the construction of the abnormal propagation graph, and the location of the root cause of the abnormality are all carried out automatically without manual participation, which saves labor costs. The abnormal root cause location of the present invention is based on the dynamic service dependency graph, which can better capture the change of the service dependency relationship at runtime, help to describe a more accurate abnormal propagation relationship, and improve the ability to locate the abnormal root cause; it is applicable to various The root cause location of various types of abnormalities has good versatility; it directly locates the abnormal service and provides the propagation path of the abnormality, which has good practicability and explainability.

本发明提供的技术方案是:The technical scheme provided by the invention is:

一种基于动态服务依赖图的异常根因定位方法,其特征在于,包括动态服务依赖图构建、异常传播图构建、异常根因定位;具体步骤包括:An abnormal root cause location method based on a dynamic service dependency graph, characterized in that it includes dynamic service dependency graph construction, exception propagation graph construction, and abnormal root cause location; the specific steps include:

1)动态服务依赖图构建,具体执行如下步骤:1) Dynamic service dependency graph construction, specifically perform the following steps:

11)提取部署时服务依赖:此处的原理是,部署在同一位置(例如,主机)上或者依赖于同一资源(例如,数据库)的两个微服务,存在部署时依赖关系。从配置管理数据库(Configuration Management Database,CMDB)中提取服务的配置信息,包括部署位置信息和关联资源信息。定义待收集配置信息的属性集合K=(k1,k2,...,kn),使得收集到的配置信息ci=(ki,vi)必有ki∈K,其中,ki表示配置信息的属性(元数据),vi表示配置信息的值。记一个服务si的配置集合为Ci=(c1,c2,...,cn),若服务si与服务sj的配置集合存在交集,即Ci∩Cj≠φ,则服务si与服务sj之间存在部署依赖。11) Extract deployment-time service dependencies: The principle here is that two microservices that are deployed on the same location (for example, a host) or depend on the same resource (for example, a database) have deployment-time dependencies. The configuration information of the service is extracted from a configuration management database (Configuration Management Database, CMDB), including deployment location information and associated resource information. Define the attribute set K=(k 1 ,k 2 ,...,k n ) of the configuration information to be collected, so that the collected configuration information c i =( ki ,v i ) must have k i ∈ K, where, k i represents the attribute (metadata) of the configuration information, and v i represents the value of the configuration information. Denote the configuration set of a service s i as C i = (c 1 , c 2 , ..., c n ), if there is an intersection between the configuration sets of service s i and service s j , that is, C i ∩C j ≠φ, Then there is deployment dependency between service s i and service s j .

12)提取运行时服务依赖:此处的原理是,若随着系统负载的变化,两服务产生的日志数量或关键性能指标(Key Performance Indicator,KPI)存在因果关系,则两服务存在运行时依赖,且因果关系的强弱可以表示运行时依赖关系的强弱。此处使用的因果推断算法是PC算法,原因是在给定可靠的条件独立性检验方法的情况下,PC算法可以处理各种类型的数据分布和因果关系,相比于其他因果推断方法具有复杂低、效果好的优势。PC算法通过条件独立性检验判断变量之间的因果关系,再利用d-分离条件确定因果关系之间的方向。此处使用的条件独立性检验方法是G2条件交叉熵度量,如式(1)所示,其服从自由度为D的χ2分布,如式(2)所示。12) Extract runtime service dependencies: The principle here is that if there is a causal relationship between the number of logs or key performance indicators (Key Performance Indicators, KPIs) generated by the two services as the system load changes, then there is a runtime dependency between the two services , and the strength of the causal relationship can represent the strength of the runtime dependency. The causal inference algorithm used here is the PC algorithm. The reason is that given a reliable conditional independence test method, the PC algorithm can handle various types of data distribution and causality, which is complex compared to other causal inference methods. Low, good effect advantages. The PC algorithm judges the causal relationship between variables through the conditional independence test, and then uses the d-separation condition to determine the direction of the causal relationship. The conditional independence test method used here is the G 2 conditional cross-entropy measure, as shown in formula (1), which obeys the χ 2 distribution with D degrees of freedom, as shown in formula (2).

Figure BDA0003958229260000031
Figure BDA0003958229260000031

D=(NX-1)(NY-1)ΠZ′∈ZNZ′ 式(2)D=(N X -1)(N Y -1)Π Z'∈Z N Z'Formula (2)

其中,

Figure BDA0003958229260000032
Z是Z′的集合,m是采样的个数。in,
Figure BDA0003958229260000032
Z is a collection of Z', and m is the number of samples.

13)构建动态服务依赖图:每次执行异常根因定位任务时,均会执行步骤11)和步骤12),动态地获取部署时服务依赖和运行时服务依赖。并基于部署时服务依赖和运行时服务依赖,构建服务依赖图G=<V,E>,其中G是有向图(Directed Graph,DG),V表示服务,是G的节点;E表示依赖关系,是G的边。构造规则为,若服务si与服务sj之间存在运行时依赖,则添加si与sj之间的有向边,边的权重为si与sj之间的G2值。若服务si与服务sj之间存在部署时依赖,则添加si到sj及sj到si的有向边,边的权重均为运行时依赖边权的均值。13) Constructing a dynamic service dependency graph: Steps 11) and 12) will be performed each time an abnormal root cause location task is executed, and service dependencies at deployment time and runtime service dependencies are dynamically obtained. And based on the deployment-time service dependency and runtime service dependency, build a service dependency graph G=<V, E>, where G is a directed graph (Directed Graph, DG), V represents a service, which is a node of G; E represents a dependency relationship , is the edge of G. The construction rule is, if there is a runtime dependency between service s i and service s j , then add a directed edge between s i and s j , and the weight of the edge is the G2 value between s i and s j . If there is a deployment-time dependency between service s i and service s j , then add directed edges from s i to s j and from s j to s i , and the weights of the edges are the mean value of the edge weights of the runtime dependencies.

2)异常传播图构建,具体执行如下步骤:2) To construct the exception propagation graph, the specific steps are as follows:

21)构建异常传播图:将输入的异常集合A={a1,a2,...,an}标记在服务依赖图G上,将标为异常的服务节点取出,形成G的子图G′,同时保留节点间的边,节点的值为异常发生时刻的倒数。其中,ai=(si,ti)表示第i个异常,si表示发生异常ai的服务,ti表示发生异常ai的时刻。为了方便计算,将异常发生的时刻进行标准化。设最先发生异常的时刻为1;其余异常按照发生的顺序,发生时刻在1的基础上递增。21) Build an exception propagation graph: mark the input exception set A={a 1 , a 2 ,..., a n } on the service dependency graph G, take out the service nodes marked as exceptions, and form a subgraph of G G', while retaining the edges between nodes, the value of the node is the reciprocal of the time when the abnormality occurs. Wherein, a i =(s i , t i ) represents the i-th abnormality, s i represents the service where the abnormality a i occurs, and t i represents the time when the abnormality a i occurs. In order to facilitate the calculation, the time when the abnormality occurs is standardized. Set the time when the first abnormality occurs as 1; the other abnormalities follow the order of occurrence, and the occurrence time is incremented on the basis of 1.

3)异常根因定位,具体执行如下步骤:3) Locating the root cause of the abnormality, specifically perform the following steps:

31)构建异常传播路径:在异常传播图G′上,随机选择一个异常的服务s′0出发,利用深度优先搜索,寻找候选根因节点及其异常传播路径。31) Constructing an anomaly propagation path: On the anomaly propagation graph G′, randomly select an abnormal service s′ 0 to start, and use depth-first search to find candidate root cause nodes and their anomaly propagation paths.

32)计算异常根因分数:对于s′0,利用异常传播图G′及异常传播路径Path(s′0,s′t),计算每一个s′t∈R的异常根因分数。计算公式如式(3)所示。32) Calculating the abnormal root cause score: For s′ 0 , use the abnormal propagation graph G′ and the abnormal propagation path Path(s′ 0 , s′ t ) to calculate the abnormal root cause score for each s′ t ∈ R. The calculation formula is shown in formula (3).

Figure BDA0003958229260000041
Figure BDA0003958229260000041

其中,score(s′t)表示异常服务s′t的异常根因分数,M表示s′0到s′t的异常传播路径个数,N表示传播路径ok中的跳数,wk,i表示第k条传播路径中节点i与节点i-1之间的依赖权重,si表示节点i的值,即异常发生时刻的倒数。此处根因分数计算公式设计的原理是,越早发生的异常越可能是根因,和有越多条传播路径的异常越可能是根因。Among them, score(s′ t ) represents the abnormal root cause score of abnormal service s′ t , M represents the number of abnormal propagation paths from s′ 0 to s′ t , N represents the number of hops in the propagation path o k , w k, i represents the dependency weight between node i and node i-1 in the kth propagation path, and s i represents the value of node i, which is the reciprocal of the moment when the abnormality occurs. The design principle of the root cause score calculation formula here is that the earlier the anomaly occurs, the more likely it is the root cause, and the more likely the anomaly with more propagation paths is the root cause.

33)报告异常根因及传播路径:对异常根因按照分数进行降序排列,并报告与之对应的异常传播路径。33) Report abnormal root cause and propagation path: arrange the abnormal root cause in descending order according to the score, and report the corresponding abnormal propagation path.

本发明进一步提供一种基于动态服务依赖图的异常根因定位系统,其特征在于,包括动态服务依赖图构建模块、异常传播图构建模块、异常根因定位模块。The present invention further provides an anomaly root cause location system based on a dynamic service dependency graph, which is characterized in that it includes a dynamic service dependency graph construction module, an anomaly propagation graph construction module, and an anomaly root cause location module.

动态服务依赖图构建模块,包括部署时服务依赖关系发现器、运行时服务依赖关系发现器、动态服务依赖图构建器、服务依赖图存储器;部署时服务依赖关系发现器使用服务配置信息,发现微服务之间的部署时依赖关系;运行时服务依赖关系发现器使用服务运行日志及关键性能指标,发现微服务之间的运行时依赖关系;动态服务依赖图构建器使用部署时服务依赖关系和运行时服务依赖关系,动态地构造服务依赖图;服务依赖图存储器用于存储服务依赖图;Dynamic service dependency graph building modules, including deployment-time service dependency discoverer, runtime service dependency discoverer, dynamic service dependency graph builder, service dependency graph storage; deployment-time service dependency Deployment-time dependencies between services; runtime service dependency discoverer uses service operation logs and key performance indicators to discover runtime dependencies between microservices; dynamic service dependency graph builder uses deployment-time service dependencies and runtime Time service dependency, dynamically construct service dependency graph; service dependency graph storage is used to store service dependency graph;

异常传播图构建模块,包括异常传播图构建器、异常传播图存储器。异常传播图构建器使用服务依赖图和输入的异常集,构造异常传播图;异常传播图存储器用于存储异常传播图;Exception propagation graph building block, including exception propagation graph builder, exception propagation graph storage. The exception propagation graph builder uses the service dependency graph and the input exception set to construct an exception propagation graph; the exception propagation graph storage is used to store the exception propagation graph;

异常根因定位模块,包括异常传播路径构建器、异常根因计算器、异常根因报告器。异常传播路径构建器使用异常传播图,寻找所有可能的异常根因节点,并构建异常的传播路径;异常根因计算器使用异常传播路径,根据路径的个数、路径上边的权重和异常发生的时刻,计算异常的根因分数;异常根因报告器将异常按照分数降序排序,并报告与之对应的异常传播路径。Abnormal root cause location module, including abnormal propagation path builder, abnormal root cause calculator, and abnormal root cause reporter. The abnormal propagation path builder uses the abnormal propagation graph to find all possible abnormal root cause nodes, and constructs the abnormal propagation path; the abnormal root cause calculator uses the abnormal propagation path, according to the number of paths, the weight on the path and the occurrence At time, the root cause score of the anomaly is calculated; the anomaly root cause reporter sorts the anomalies in descending order of the scores, and reports the corresponding anomaly propagation path.

与现有技术相比,本发明的有益效果是:Compared with prior art, the beneficial effect of the present invention is:

本发明提供了一种基于动态服务依赖图的异常根因定位方法及系统通过读取服务配置信息以及服务运行日志和关键性能指标,动态地构建服务依赖图。并将服务依赖图进行存储,用于后续生成异常传播图。当系统发生异常后,根据服务依赖图和异常集合,构建异常传播图。然后利用深度优先搜索,遍历所有可能的根因异常及其传播路径,利用异常传播路径的个数、路径上边的权重和异常发生的时刻,计算异常的根因分数,按照分数降序排序,并报告与之对应的异常传播路径。本发明能够实现自动构建动态服务依赖图,生成异常传播图,并构建异常传播路径,计算根因分数,以此定位异常根因,并报告相应的传播路径。本发明主要具有以下特点:The invention provides a method for locating abnormal root causes based on a dynamic service dependency graph and the system dynamically builds a service dependency graph by reading service configuration information, service operation logs and key performance indicators. The service dependency graph is stored for subsequent generation of an exception propagation graph. When an exception occurs in the system, an exception propagation graph is constructed based on the service dependency graph and exception collection. Then use depth-first search to traverse all possible root cause anomalies and their propagation paths, use the number of anomaly propagation paths, the weight on the path, and the time when the anomaly occurs to calculate the root cause score of the anomaly, sort in descending order of the score, and report The corresponding exception propagation path. The present invention can automatically build a dynamic service dependency graph, generate an abnormality propagation graph, construct an abnormality propagation path, calculate root cause scores, thereby locate the root cause of the abnormality, and report the corresponding propagation path. The present invention mainly has the following characteristics:

(一)本发明提供的系统和方法以服务配置信息和运行时信息(关键性能指标和日志)为基础,自动构建动态服务依赖图。(1) The system and method provided by the present invention automatically build a dynamic service dependency graph based on service configuration information and runtime information (key performance indicators and logs).

(二)本发明提供的系统和方法能够使用输入的异常和服务依赖图,自动构建异常传播图。(2) The system and method provided by the present invention can use the input exception and service dependency graph to automatically build an exception propagation graph.

(三)本发明提供的系统和方法能够定位异常根因服务的同时,提供异常传播路径,具有很好的实用性和可解释性。(3) The system and method provided by the present invention can not only locate the abnormal root service, but also provide an abnormal propagation path, which has good practicability and explainability.

利用本发明的技术方案,可以实现自动构建动态服务依赖图,构建异常传播图,定位异常根因,并提供异常的传播路径,适用于微服务系统。The technical solution of the present invention can automatically construct a dynamic service dependency graph, construct an anomaly propagation graph, locate the root cause of anomalies, and provide an anomaly propagation path, which is suitable for microservice systems.

附图说明Description of drawings

图1是本发明提供的基于动态服务依赖图的异常根因定位方法;Fig. 1 is an abnormal root cause location method based on a dynamic service dependency graph provided by the present invention;

图2是本发明提供的基于动态服务依赖图的异常根因定位系统。Fig. 2 is an abnormal root cause location system based on a dynamic service dependency graph provided by the present invention.

具体实施方式Detailed ways

下面结合附图,通过实施例进一步描述本发明,但不以任何方式限制本发明的范围。Below in conjunction with accompanying drawing, further describe the present invention through embodiment, but do not limit the scope of the present invention in any way.

图1是本发明提供的基于动态服务依赖图的异常根因定位方法的流程框图。本发明包括动态服务依赖图构建、异常传播图构建、异常根因定位;Fig. 1 is a flow chart of an abnormal root cause location method based on a dynamic service dependency graph provided by the present invention. The invention includes dynamic service dependency graph construction, abnormal propagation graph construction, abnormal root cause location;

动态服务依赖图构建使用服务配置信息以及服务运行日志和关键性能指标,挖掘服务间的部署时依赖关系和运行时依赖关系,动态地构造服务依赖图。配置信息用于描述服务所处的位置,及其所依赖的资源与服务。每个配置项以两元组的形式存在,包括配置属性和配置值。运行日志用于记录系统的运行状况,包括关键变量的输出以及关键运行位置的标记等。运行日志以时间序列文本的形式存在,在本发明中被转换为了日志频率的时间序列的形式。关键性能指标用于监控系统的运行状态,用于监控系统是否发生异常。关键性能指标以时间序列的形式存在。服务依赖图是有向图,刻画了微服务系统中各个服务之间的依赖关系,图的节点是服务,边是服务之间的依赖。服务依赖图随着系统运行数据的积累,动态构建,具有对系统迭代的自适应性。The dynamic service dependency graph construction uses service configuration information, service operation logs and key performance indicators to mine the deployment-time dependencies and runtime dependencies between services, and dynamically constructs the service dependency graph. Configuration information is used to describe the location of the service and the resources and services it depends on. Each configuration item exists in the form of two tuples, including configuration attributes and configuration values. The running log is used to record the running status of the system, including the output of key variables and the marking of key running positions. The operation log exists in the form of time series text, which is converted into the form of time series of log frequency in the present invention. Key performance indicators are used to monitor the operating status of the system and to monitor whether the system is abnormal. Key performance indicators exist in the form of time series. The service dependency graph is a directed graph, which depicts the dependency relationship between various services in the microservice system. The nodes of the graph are services, and the edges are the dependencies between services. The service dependency graph is dynamically constructed with the accumulation of system operation data, and is adaptive to system iteration.

异常传播图构建使用输入的异常集合和服务依赖图,构建异常传播图,用于之后的传播路径分析。每个异常项以两元组的形式存在,包括发生异常的服务和发生异常的时刻。异常传播图也是有向图的形式,是服务依赖图的子图。Exception propagation graph construction uses the input exception collection and service dependency graph to construct an exception propagation graph for subsequent propagation path analysis. Each exception item exists in the form of a two-tuple, including the service where the exception occurred and the time when the exception occurred. The exception propagation graph is also in the form of a directed graph, which is a subgraph of the service dependency graph.

异常根因定位使用异常传播图,寻找所有可能的异常根因节点,并构建与之对应的异常传播路径,用于计算节点的根因分数,并降序排序生成异常根因报告。异常根因报告包含根因服务分数,及其对应的异常传播路径。Abnormal root cause location uses the abnormal propagation graph to find all possible abnormal root cause nodes, and constructs the corresponding abnormal propagation path, which is used to calculate the root cause score of the nodes, and sorts in descending order to generate an abnormal root cause report. An anomaly root cause report contains root cause service scores and their corresponding anomaly propagation paths.

针对上述基于动态服务依赖图的异常根因定位方法,所述动态服务依赖图构建具体执行如下步骤:For the above abnormal root cause location method based on the dynamic service dependency graph, the construction of the dynamic service dependency graph specifically performs the following steps:

11)提取部署时服务依赖:此处的原理是,部署在同一位置(例如,主机)上或者依赖于同一资源(例如,数据库)的两个微服务,存在部署时依赖关系。从配置管理数据库(Configuration Management Database,CMDB)中提取服务的配置信息,包括部署位置信息和关联资源信息。定义待收集配置信息的属性集合K=(k1,k2,...,kn),使得收集到的配置信息ci=(ki,vi)必有ki∈K,其中,ki表示配置信息的属性(元数据),vi表示配置信息的值。记一个服务si的配置集合为Ci=(c1,c2,...,cn),若服务si与服务sj的配置集合存在交集,即Ci∩Cj≠φ,则服务si与服务sj之间存在部署依赖。11) Extract deployment-time service dependencies: The principle here is that two microservices that are deployed on the same location (for example, a host) or depend on the same resource (for example, a database) have deployment-time dependencies. The configuration information of the service is extracted from a configuration management database (Configuration Management Database, CMDB), including deployment location information and associated resource information. Define the attribute set K=(k 1 ,k 2 ,...,k n ) of the configuration information to be collected, so that the collected configuration information c i =( ki ,v i ) must have k i ∈ K, where, k i represents the attribute (metadata) of the configuration information, and v i represents the value of the configuration information. Denote the configuration set of a service s i as C i = (c 1 , c 2 , ..., c n ), if there is an intersection between the configuration sets of service s i and service s j , that is, C i ∩C j ≠φ, Then there is deployment dependency between service s i and service s j .

12)提取运行时服务依赖:此处的原理是,若着系统负载的变化,两服务产生的日志数量或关键性能指标(Key Performance Indicator,KPI)存在因果关系,则两服务存在运行时依赖,且因果关系的强弱可以表示运行时依赖关系的强弱。此处使用的因果推断算法是PC算法,原因是在给定可靠的条件独立性检验方法的情况下,PC算法可以处理各种类型的数据分布和因果关系,相比于其他因果推断方法具有复杂低、效果好的优势。PC算法通过条件独立性检验判断变量之间的因果关系,再利用d-分离条件确定因果关系之间的方向。此处使用的条件独立性检验方法是G2条件交叉熵度量,如式(1)所示,其服从自由度为D的χ2分布,如式(2)所示。12) Extract runtime service dependencies: The principle here is that if there is a causal relationship between the number of logs generated by the two services or the Key Performance Indicator (KPI) as the system load changes, the two services have runtime dependencies. And the strength of the causal relationship can represent the strength of the runtime dependency. The causal inference algorithm used here is the PC algorithm. The reason is that given a reliable conditional independence test method, the PC algorithm can handle various types of data distribution and causality, which is complex compared to other causal inference methods. Low, good effect advantages. The PC algorithm judges the causal relationship between variables through the conditional independence test, and then uses the d-separation condition to determine the direction of the causal relationship. The conditional independence test method used here is the G 2 conditional cross-entropy measure, as shown in formula (1), which obeys the χ 2 distribution with D degrees of freedom, as shown in formula (2).

Figure BDA0003958229260000061
Figure BDA0003958229260000061

D=(NX-1)(NY-1)ΠZ′∈ZNZ′ 式(2)D=(N X -1)(N Y -1)Π Z'∈Z N Z'Formula (2)

其中,

Figure BDA0003958229260000062
Z是Z′的集合,m是采样的个数。in,
Figure BDA0003958229260000062
Z is a collection of Z', and m is the number of samples.

121)确定运行时服务依赖关系权重:具体来说,此处使用的KPI是服务级别目标(Service Level Objective,SLO)指标,如服务请求延迟,用于评估一个服务是否运行正常。记一个服务si的KPI时间序列为Ti={t1,t2,...,tn},日志频数序列为Li={l1,l2,...,ln},时间窗口大小为

Figure BDA0003958229260000072
时间窗口个的总数为n。初始时,假设在m个服务中,任意两个服务之间存在因果关系。利用时间序列数据T,计算任意si与sj之间的G2值,并查询其在χ2分布的p值,若p值>ξ,则si与sj之间的条件独立性假设被接受,否则拒绝。若si与sj之间的条件独立性假设被接受,则记录此时条件Z作为si与sj的分割条件S(si,sj)。同理,利用日志频数序列数据L,对si与sj进行条件独立性检验。如果针对si与sj的关于T和L的两次独立性假设均被接受,则判定sai与sj之间没有因果关系,否则判定si与sj之间存在因果关系。遍历所有的si与sj对,直至确定所有的si与sj对之间的因果关系。121) Determine runtime service dependency weights: Specifically, the KPI used here is a Service Level Objective (SLO) indicator, such as service request delay, used to evaluate whether a service is running normally. Record the KPI time series of a service s i as T i ={t 1 , t 2 ,...,t n }, and the log frequency sequence as L i ={l 1 , l 2 ,...,l n }, The time window size is
Figure BDA0003958229260000072
The total number of time windows is n. Initially, it is assumed that among m services, there is a causal relationship between any two services. Using the time series data T, calculate the G 2 value between any s i and s j , and query its p value in the χ 2 distribution, if the p value > ξ, then the conditional independence assumption between s i and s j Accepted, otherwise rejected. If the conditional independence assumption between s i and s j is accepted, record the condition Z at this time as the split condition S(s i , s j ) between s i and s j . Similarly, use the log frequency sequence data L to test the conditional independence of s i and s j . If the two independence assumptions about T and L for s i and s j are accepted, then it is judged that there is no causal relationship between sa i and s j , otherwise it is judged that there is a causal relationship between s i and s j . Traverse all s i and s j pairs until the causal relationship between all s i and s j pairs is determined.

122)确定运行时服务依赖关系方向:之后用d-分离条件确定因果关系之间的方向。d-分离条件共有四条规则,122) Determining the direction of service dependency at runtime: then use the d-separation condition to determine the direction between causal relationships. There are four rules for the d-separation condition,

(1)对于任意不相邻(没有因果关系)的两个变量X和Y,且拥有共同的邻居变量Z,若

Figure BDA0003958229260000071
则将X-Z-Y赋予方向X→Z←Y。(1) For any two variables X and Y that are not adjacent (no causal relationship), and have a common neighbor variable Z, if
Figure BDA0003958229260000071
Then assign XZY to the direction X→Z←Y.

(2)若存在X→Y,则将所有Y-Z赋予方向Y→Z。(2) If X→Y exists, assign all Y-Z to the direction Y→Z.

(3)若存在X→Z→Y,则将所有X-Y赋予方向X→Y。(3) If X→Z→Y exists, assign all X-Y to the direction X→Y.

(4)若同时存在X-Z1→Y和X-Z2→Y,则将所有X-Y赋予方向X→Y。(4) If XZ 1 →Y and XZ 2 →Y exist at the same time, assign all XY to the direction X→Y.

其中规则(1)优先于,规则(2)(3)(4),即确保所有的规则(1)都执行后,再执行规则(2)(3)(4)。规则(2)(3)(4)的执行没有先后顺序。Among them, rule (1) takes precedence over rule (2)(3)(4), that is, to ensure that all rules (1) are executed, and then execute rule (2)(3)(4). Rules (2)(3)(4) are executed in no order.

若执行完上述规则后,仍无法确定si与sj之间依赖的方向,则添加双向依赖,即添加si到sj的依赖关系,同时添加sj到si的依赖关系。If the direction of dependence between s i and s j cannot be determined after executing the above rules, add a two-way dependency, that is, add the dependency relationship from s i to s j , and add the dependency relationship from s j to s i at the same time.

13)构建动态服务依赖图:每次执行异常根因定位任务时,均会执行步骤11)和步骤12),动态地获取部署时服务依赖和运行时服务依赖。并基于部署时服务依赖和运行时服务依赖,构建服务依赖图G=<V,E>,其中G是有向图(Directed Graph,DG),V表示服务,是G的节点;E表示依赖关系,是G的边。构造规则为,若服务si与服务sj之间存在运行时依赖,则添加si与sj之间的有向边,边的权重为si与sj之间的G2值。若服务si与服务sj之间存在部署时依赖,则添加si到sj及sj到si的有向边,边的权重均为运行时依赖边权的均值。13) Constructing a dynamic service dependency graph: Steps 11) and 12) will be performed each time an abnormal root cause location task is executed, and service dependencies at deployment time and runtime service dependencies are dynamically obtained. And based on the deployment-time service dependency and runtime service dependency, build a service dependency graph G=<V, E>, where G is a directed graph (Directed Graph, DG), V represents a service, which is a node of G; E represents a dependency relationship , is the edge of G. The construction rule is, if there is a runtime dependency between service s i and service s j , then add a directed edge between s i and s j , and the weight of the edge is the G2 value between s i and s j . If there is a deployment-time dependency between service s i and service s j , then add directed edges from s i to s j and from s j to s i , and the weights of the edges are the mean value of the edge weights of the runtime dependencies.

针对上述基于动态服务依赖图的异常根因定位方法,所述异常传播图构建具体执行如下步骤:For the above abnormal root cause location method based on the dynamic service dependency graph, the construction of the abnormal propagation graph specifically performs the following steps:

21)构建异常传播图:将输入的异常集合A={a1,a2,...,an}标记在服务依赖图G上,将标为异常的服务节点取出,形成G的子图G′,同时保留节点间的边,节点的值为异常发生时刻的倒数。其中,ai=(si,ti)表示第i个异常,si表示发生异常ai的服务,ti表示发生异常ai的时刻。为了方便计算,将异常发生的时刻进行标准化。设最先发生异常的时刻为1;其余异常按照发生的顺序,发生时刻在1的基础上递增。21) Build an exception propagation graph: mark the input exception set A={a 1 , a 2 ,..., a n } on the service dependency graph G, take out the service nodes marked as exceptions, and form a subgraph of G G', while retaining the edges between nodes, the value of the node is the reciprocal of the time when the abnormality occurs. Wherein, a i =(s i , t i ) represents the i-th abnormality, s i represents the service where the abnormality a i occurs, and t i represents the time when the abnormality a i occurs. In order to facilitate the calculation, the time when the abnormality occurs is standardized. Set the time when the first abnormality occurs as 1; the other abnormalities follow the order of occurrence, and the occurrence time is incremented on the basis of 1.

针对上述基于动态服务依赖图的异常根因定位方法,所述异常根因定位具体执行如下步骤:For the above abnormal root cause location method based on the dynamic service dependency graph, the abnormal root cause location specifically performs the following steps:

31)构建异常传播路径:在异常传播图G′上,随机选择一个异常的服务s′0出发,利用深度优先搜索,寻找候选根因节点及其异常传播路径。31) Constructing an anomaly propagation path: On the anomaly propagation graph G′, randomly select an abnormal service s′ 0 to start, and use depth-first search to find candidate root cause nodes and their anomaly propagation paths.

311)寻找候选根因节点集合:在寻找候选根因节点阶段,初始化,s′i=s′0,ok=φ,R=φ,然后执行如下递归步骤:311) Finding candidate root cause node sets: In the stage of finding candidate root cause nodes, initialize, s' i =s' 0 , o k =φ, R=φ, and then perform the following recursive steps:

(1)若s′i∈ok,则返回上级调用。(1) If s′ i ∈ o k , then return to the superior call.

(2)将s′i加入到ok(2) Add s′ i to ok .

(3)若s′i不存在相邻的异常节点,则将s′i加入到候选根因节点集合R中。(3) If s' i does not have adjacent abnormal nodes, then add s' i to the set R of candidate root cause nodes.

(4)若s′i存在相邻的异常节点s′j,对每一个s′j,若

Figure BDA0003958229260000081
令s′i=s′j,并执行步骤(1)。(4) If s′ i has an adjacent abnormal node s′ j , for each s′ j , if
Figure BDA0003958229260000081
Set s' i =s' j , and execute step (1).

(5)返回上级调用。(5) Return to the superior call.

递归执行步骤(1)到步骤(5),直到不再有新的异常服务加入到R中。Recursively execute steps (1) to (5) until no new exception service is added to R.

312)确定异常传播路径:在寻找传播路径阶段,对于每一个s′t∈R,记从异常服务s′0到异常服务s′t的异常传播路径集为Path(s′0,s′t)={o1,o2,...,on},其中ok为一条异常传播路径,ok={s′0,...,st′}。初始化,s′i=s′0,ok=φ,Path(s′0,s′t)=φ,然后执行如下递归步骤312) Determine the abnormal propagation path: in the stage of finding the propagation path, for each s′ t ∈ R, record the abnormal propagation path set from the abnormal service s′ 0 to the abnormal service s′ t as Path(s′ 0 , s′ t )={o 1 , o 2 ,..., o n }, where o k is an abnormal propagation path, ok k ={s′ 0 ,..., s t ′}. Initialization, s′ i =s′ 0 , o k =φ, Path(s′ 0 , s′ t )=φ, and then perform the following recursive steps

(1)若s′i∈ok,则返回上级调用。(1) If s′ i ∈ o k , then return to the superior call.

(2)将s′i加入到ok(2) Add s′ i to ok .

(3)若s′i==s′t,则将ok加入到Path(s′0,s′t)。(3) If s' i == s' t , add o k to Path(s' 0 , s' t ).

(4)若s′i存在相邻的异常服务节点s′j,对每一个s′j,若

Figure BDA0003958229260000082
令s′i=s′j,并执行步骤(1)。(4) If s′ i has an adjacent abnormal service node s′ j , for each s′ j , if
Figure BDA0003958229260000082
Set s' i =s' j , and execute step (1).

(5)返回上级调用。(5) Return to the superior call.

递归执行步骤(1)到步骤(5),直到不再有新的异常传播路径加入到Path(s′0,s′t)中。Steps (1) to (5) are recursively executed until no new anomaly propagation path is added to Path(s′ 0 , s′ t ).

32)计算异常根因分数:对于s′0,利用异常传播图G′及异常传播路径Path(s′0,s′t),计算每一个s′t∈R的异常根因分数。计算公式如式(3)所示。32) Calculating the abnormal root cause score: For s′ 0 , use the abnormal propagation graph G′ and the abnormal propagation path Path(s′ 0 , s′ t ) to calculate the abnormal root cause score for each s′ t ∈ R. The calculation formula is shown in formula (3).

Figure BDA0003958229260000083
Figure BDA0003958229260000083

其中,score(s′t)表示异常服务s′t的异常根因分数,M表示s′0到s′t的异常传播路径个数,N表示传播路径ok中的跳数,wk,i表示第k条传播路径中节点i与节点i-1之间的依赖权重,si表示节点i的值,即异常发生时刻的倒数。此处根因分数计算公式设计的原理是,越早发生的异常越可能是根因,和有越多条传播路径的异常越可能是根因。Among them, score(s′ t ) represents the abnormal root cause score of abnormal service s′ t , M represents the number of abnormal propagation paths from s′ 0 to s′ t , N represents the number of hops in the propagation path o k , w k, i represents the dependency weight between node i and node i-1 in the kth propagation path, and s i represents the value of node i, which is the reciprocal of the moment when the abnormality occurs. The design principle of the root cause score calculation formula here is that the earlier the anomaly occurs, the more likely it is the root cause, and the more likely the anomaly with more propagation paths is the root cause.

33)报告异常根因及传播路径:对异常根因按照分数进行降序排列,并报告与之对应的异常传播路径。33) Report abnormal root cause and propagation path: arrange the abnormal root cause in descending order according to the score, and report the corresponding abnormal propagation path.

图2是本发明提供的基于动态服务依赖图的异常根因定位系统的结构框图。Fig. 2 is a structural block diagram of an abnormal root cause location system based on a dynamic service dependency graph provided by the present invention.

本发明提供了一种实现基于动态服务依赖图的异常根因定位方法的系统,系统以配置信息、运行日志、关键性能指标、异常集合作为输入,包括动态服务依赖图构建模块、异常传播图构建模块、异常根因定位模块;The present invention provides a system for realizing an abnormal root cause location method based on a dynamic service dependency graph. The system uses configuration information, operation logs, key performance indicators, and abnormal collections as inputs, and includes a dynamic service dependency graph construction module and an abnormal propagation graph construction module, abnormal root cause location module;

下面分别对不同的模块进行具体说明。The different modules are described in detail below.

S1)动态服务依赖图构建模块S1) Dynamic service dependency graph building blocks

动态服务依赖图构建模块的功能是,基于服务配置信息、运行日志、关键性能指标,构建动态服务依赖图。该模块包含四个子模块:The function of the dynamic service dependency graph building module is to build a dynamic service dependency graph based on service configuration information, operation logs, and key performance indicators. This module contains four submodules:

S11)部署时服务依赖关系发现器S11) Deployment time service dependency finder

部署时服务依赖关系发现器基于服务配置信息,挖掘部署在同一位置或依赖同一资源的微服务之间的部署时依赖关系。The deployment-time service dependency finder mines deployment-time dependencies between microservices that are deployed at the same location or depend on the same resource based on service configuration information.

S12)运行时服务依赖关系发现器S12) Runtime Service Dependency Finder

运行时服务依赖关系发现器基于服务运行日志和关键性能指标,挖掘微服务之间的运行时依赖关系。The runtime service dependency discoverer mines the runtime dependencies between microservices based on service running logs and key performance indicators.

S13)动态服务依赖图构建器S13) Dynamic service dependency graph builder

动态服务依赖图构建器基于服务部署时依赖关系和运行时依赖关系,动态地构建服务依赖图。The dynamic service dependency graph builder dynamically builds a service dependency graph based on service deployment-time dependencies and runtime dependencies.

S14)服务依赖图存储器S14) service dependency graph memory

服务依赖图存储器以矩阵的形式存储服务依赖图,并提供对服务依赖图高性能的查询。The service dependency graph storage stores the service dependency graph in the form of a matrix, and provides high-performance query on the service dependency graph.

S2)异常传播图构建模块S2) Exception Propagation Graph Building Blocks

异常传播图构建模块的功能是根据服务依赖图和输入的异常集合,构建异常传播图。该模块包含两个子模块:The function of the exception propagation graph building block is to build an exception propagation graph according to the service dependency graph and the input exception collection. This module contains two submodules:

S21)异常传播图构建器S21) Exception Propagation Graph Builder

异常传播图构建器根据服务依赖图和输入的异常集合,在服务依赖图上标记异常并生成子图,构造异常传播图。The exception propagation graph builder marks exceptions on the service dependency graph and generates subgraphs according to the service dependency graph and the input exception set, and constructs the exception propagation graph.

S22)异常传播图存储器S22) abnormal propagation map memory

异常传播图存储器以矩阵的形式存储异常传播图,并提供对异常传播图高性能的查询。The exception propagation graph memory stores the exception propagation graph in the form of a matrix, and provides high-performance query on the exception propagation graph.

S3)异常根因定位模块S3) abnormal root cause location module

异常根因定位模块的功能是根据异常传播图,寻找所有可能的异常根因节点,并构建其异常传播路径,计算根因分数,生成异常根因报告。该模块分为三个子模块:The function of the abnormal root cause location module is to find all possible abnormal root cause nodes according to the abnormal propagation graph, construct their abnormal propagation paths, calculate the root cause score, and generate an abnormal root cause report. This module is divided into three submodules:

S31)异常传播路径构建器S31) Exception propagation path builder

异常传播路径构建器使用异常传播图,寻找所有可能的异常根因节点,并构建异常的传播路径。The abnormal propagation path builder uses the abnormal propagation graph to find all possible abnormal root cause nodes and constructs the abnormal propagation path.

S32)异常根因计算器S32) abnormal root cause calculator

异常根因计算器使用异常传播路径,根据异常传播路径的个数、路径上边的权重和异常发生的时刻,计算异常的根因分数。The anomaly root cause calculator uses the anomaly propagation path to calculate the root cause score of the anomaly based on the number of anomaly propagation paths, the weight on the path, and the time when the anomaly occurs.

S32)异常根因报告器S32) abnormal root cause reporter

异常根因报告器将异常按照分数降序排序,并报告与之对应的异常传播路径。The anomaly root cause reporter sorts anomalies in descending order of scores and reports the corresponding anomaly propagation paths.

需要注意的是,公布实施例的目的在于帮助进一步理解本发明,但是本领域的技术人员可以理解:在不脱离本发明及所附权利要求的精神和范围内,各种替换和修改都是可能的。因此,本发明不应局限于实施例所公开的内容,本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understand the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims of. Therefore, the present invention should not be limited to the content disclosed in the embodiments, and the protection scope of the present invention is subject to the scope defined in the claims.

Claims (7)

1. An abnormal root cause positioning method based on a dynamic service dependency graph is characterized by comprising the steps of constructing the dynamic service dependency graph, constructing an abnormal propagation graph and positioning the abnormal root cause; the method comprises the following specific steps:
1) Constructing a dynamic service dependency graph, and specifically executing the following steps:
11 Extract deployment-time service dependencies;
12 Extract runtime service dependencies;
13 Build a dynamic service dependency graph G: step 11) and step 12) are executed each time the abnormal root cause positioning task is executed, service dependence during deployment and service dependence during operation are dynamically obtained, and a service dependence graph G = < V, E >; wherein G is a directed graph, V represents a service, and is a node of G; e represents a dependency, being an edge of G;
2) Constructing an abnormal propagation diagram, and specifically executing the following steps:
constructing an abnormal propagation map: exception set A = { a) to be input 1 ,a 2 ,...,a n Marking the service dependency graph G, taking out the service nodes marked as abnormal to form a subgraph G' of the G, and simultaneously reserving edges among the nodes, wherein the values of the nodes are the reciprocal of the abnormal occurrence time;
3) And (3) positioning the abnormal root cause, and specifically executing the following steps:
31 Construct an exception propagation path: randomly selecting one abnormal service s 'on the abnormal propagation graph G' 0 Starting, searching candidate root cause nodes and abnormal propagation paths thereof by depth-first search;
32 Compute an abnormal root score: to s' 0 Using the abnormal propagation map G 'and the abnormal propagation Path Path (s' 0 ,s′ t ) Calculate each s' t E.g. abnormal root score of R;
33 Report exception root cause and propagation path: and sorting the abnormal root causes according to the scores in a descending order, and reporting the abnormal propagation path corresponding to the abnormal root causes.
2. The dynamic service dependency graph-based anomaly root cause locating method according to claim 1, wherein in step 11) a set of attributes K = (K) for configuration information to be collected is defined 1 ,k 2 ,...,k n ) So that the collected configuration information c i =(k i ,b i ) Must have k i E.g. K, wherein K i Attribute representing configuration information, v i Value representing configuration information, remembering a service s i Is C i =(c 1 ,c 2 ,...,c n ) If service s i And service s j There is an intersection of the configuration sets of, i.e. C i ∩C j Not equal to phi, then service s i And service s j There is a deployment dependency between.
3. The dynamic service dependency graph-based abnormal root cause locating method as claimed in claim 1, wherein step 12) employs a PC algorithm to judge causal relationships between variables through conditional independence test, and then determines direction between causal relationships using d-separation condition, and the conditional independence test method used is G 2 The conditional cross entropy measure is as shown in formula (1) subject to χ with degree of freedom D 2 Distribution, as shown in formula (2):
Figure FDA0003958229250000011
D=(N X -1)(N Y -1)Π Z′∈Z N Z′ formula (2)
wherein ,
Figure FDA0003958229250000021
z is the set of Z' and m is the number of samples.
4. The dynamic service dependency graph-based anomaly root cause positioning method according to claim 1, wherein in step 13) the construction rule is that if the service s is a service s i And service s j If there is a runtime dependency between them, then add s i And s j With directed edges in between, the weight of the edge being s i And s j G between 2 Value, if service s i And service s j There is a deployment-time dependency between, then s is added i To s j And s j To s i The weights of the edges are all the mean values of the run-time dependent edge weights.
5. The dynamic service dependency graph-based abnormal root cause locating method as claimed in claim 1, wherein in step 2), a i =(s i ,t i ) Indicates the ith exception, s i Indicates the occurrence of an anomaly a i Service of t i Indicating the occurrence of an anomaly a i Standardizing the time when the abnormality occurs, and setting the time when the abnormality occurs firstly as 1; the other exceptions are in the order of occurrence, and the occurrence time is increased on the basis of 1.
6. The method for locating abnormal root cause based on dynamic service dependency graph as claimed in claim 1, wherein the calculation formula in step 32) is shown in formula (3):
Figure FDA0003958229250000022
wherein, score (s' t ) Represents abnormal service s' t M represents s' 0 To s' t N represents the propagation path o k Number of hops in, w k,i Represents the dependency weight between node i and node i-1 in the k-th propagation path, s i The value of node i, i.e. the inverse of the time of occurrence of the anomaly.
7. The abnormal root cause positioning system based on the dynamic service dependency graph is characterized by comprising a dynamic service dependency graph building module, an abnormal propagation graph building module and an abnormal root cause positioning module, wherein:
the dynamic service dependency graph building module comprises a service dependency relationship finder during deployment, a service dependency relationship finder during operation, a dynamic service dependency graph builder and a service dependency graph memory; the service dependency relationship discovering device discovers the dependency relationship between the micro services when deployed by using the service configuration information; the runtime service dependency relationship finder finds runtime dependency relationships among the microservices by using the service running logs and the key performance indexes; the dynamic service dependency graph builder uses the service dependency relationship during deployment and the service dependency relationship during operation to dynamically construct a service dependency graph; the service dependency graph memory is used for storing a service dependency graph;
the abnormal propagation graph constructing module comprises an abnormal propagation graph constructor and an abnormal propagation graph memory, wherein the abnormal propagation graph constructor constructs an abnormal propagation graph by using the service dependency graph and the input abnormal set; the exception propagation map memory is used for storing an exception propagation map;
the abnormal root cause positioning module comprises an abnormal propagation path builder, an abnormal root cause calculator and an abnormal root cause reporter, wherein the abnormal propagation path builder uses an abnormal propagation graph to search all possible abnormal root cause nodes and build an abnormal propagation path; the abnormal root cause calculator uses the abnormal propagation path and calculates the abnormal root cause score according to the number of the paths, the weight on the upper side of the path and the abnormal occurrence time; and the abnormal root cause reporter sorts the abnormal according to the descending order of the scores and reports the abnormal propagation path corresponding to the abnormal root cause reporter.
CN202211470197.6A 2022-11-23 2022-11-23 A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph Active CN115756929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211470197.6A CN115756929B (en) 2022-11-23 2022-11-23 A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211470197.6A CN115756929B (en) 2022-11-23 2022-11-23 A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph

Publications (2)

Publication Number Publication Date
CN115756929A true CN115756929A (en) 2023-03-07
CN115756929B CN115756929B (en) 2023-06-02

Family

ID=85335430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211470197.6A Active CN115756929B (en) 2022-11-23 2022-11-23 A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph

Country Status (1)

Country Link
CN (1) CN115756929B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116450399A (en) * 2023-06-13 2023-07-18 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN116820826A (en) * 2023-08-28 2023-09-29 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain
CN117792696A (en) * 2023-12-07 2024-03-29 北京邮电大学 A method and device for log anomaly detection and location for distributed systems

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103606042A (en) * 2013-11-18 2014-02-26 南京理工大学 Service combination instance migration effectiveness judgment method based on dynamic dependency graph
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN112787841A (en) * 2019-11-11 2021-05-11 华为技术有限公司 Fault root cause positioning method and device and computer storage medium
US20220019495A1 (en) * 2020-07-14 2022-01-20 Microsoft Technology Licensing, Llc Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph
EP3951598A1 (en) * 2020-08-07 2022-02-09 NEC Laboratories Europe GmbH Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs
CN115118621A (en) * 2022-06-27 2022-09-27 浙江大学 Micro-service performance diagnosis method and system based on dependency graph
CN115278741A (en) * 2022-06-15 2022-11-01 清华大学 Fault diagnosis method and device based on multi-mode data dependency relationship

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103606042A (en) * 2013-11-18 2014-02-26 南京理工大学 Service combination instance migration effectiveness judgment method based on dynamic dependency graph
CN112787841A (en) * 2019-11-11 2021-05-11 华为技术有限公司 Fault root cause positioning method and device and computer storage medium
US20220019495A1 (en) * 2020-07-14 2022-01-20 Microsoft Technology Licensing, Llc Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph
EP3951598A1 (en) * 2020-08-07 2022-02-09 NEC Laboratories Europe GmbH Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN115278741A (en) * 2022-06-15 2022-11-01 清华大学 Fault diagnosis method and device based on multi-mode data dependency relationship
CN115118621A (en) * 2022-06-27 2022-09-27 浙江大学 Micro-service performance diagnosis method and system based on dependency graph

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116450399A (en) * 2023-06-13 2023-07-18 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN116450399B (en) * 2023-06-13 2023-08-22 西华大学 Microservice system fault diagnosis and root cause location method
CN116820826A (en) * 2023-08-28 2023-09-29 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain
CN116820826B (en) * 2023-08-28 2023-11-24 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain
CN117792696A (en) * 2023-12-07 2024-03-29 北京邮电大学 A method and device for log anomaly detection and location for distributed systems

Also Published As

Publication number Publication date
CN115756929B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
CN115756929B (en) A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph
CN110888755A (en) A method and device for finding abnormal root cause nodes in a microservice system
US8214372B2 (en) Determining configuration parameter dependencies via analysis of configuration data from multi-tiered enterprise applications
US6697802B2 (en) Systems and methods for pairwise analysis of event data
CN116450399B (en) Microservice system fault diagnosis and root cause location method
CN106716352A (en) Managing parameter sets
CN112528519A (en) Method, system, readable medium and electronic device for engine quality early warning service
CN105117771B (en) A kind of agricultural machinery fault recognition method based on correlation rule directed acyclic graph
US8954311B2 (en) Arrangements for extending configuration management in large IT environments to track changes proactively
CN113326187B (en) Data-driven memory leakage intelligent detection method and system
CN115237717A (en) Micro-service abnormity detection method and system
CN110825817B (en) Enterprise suspected association judgment method and system
CN105260742A (en) Unified classification method for multiple types of data and system
CN116737436A (en) Microservice system root cause location method and system for hybrid deployment scenarios
CN118210772B (en) Log management method, device, electronic device and storage medium
CN114385397A (en) Micro-service fault root cause positioning method based on fault propagation diagram
CN103455593A (en) Service competitiveness realization system and method based on social contact network
CN107239498A (en) A kind of method for excavating overlapping community&#39;s dynamic evolution correlation rule
Liu et al. Social group query based on multi-fuzzy-constrained strong simulation
Abul-Basher Multiple-query optimization of regular path queries
Natarajan et al. A scalable and generic framework to mine top-k representative subgraph patterns
EP4339845A1 (en) Method, apparatus and electronic device for detecting data anomalies, and readable storage medium
Javidian et al. Learning LWF chain graphs: An order independent algorithm
Smetsers et al. Bigger is not always better: on the quality of hypotheses in active automata learning
CN117150507A (en) Vulnerability positioning system and method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant