CN115756929A - Abnormal root cause positioning method and system based on dynamic service dependency graph - Google Patents
Abnormal root cause positioning method and system based on dynamic service dependency graph Download PDFInfo
- Publication number
- CN115756929A CN115756929A CN202211470197.6A CN202211470197A CN115756929A CN 115756929 A CN115756929 A CN 115756929A CN 202211470197 A CN202211470197 A CN 202211470197A CN 115756929 A CN115756929 A CN 115756929A
- Authority
- CN
- China
- Prior art keywords
- abnormal
- service
- root cause
- graph
- propagation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 158
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000010586 diagram Methods 0.000 claims abstract description 4
- 230000005856 abnormality Effects 0.000 claims description 33
- 230000001364 causal effect Effects 0.000 claims description 21
- 238000010276 construction Methods 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000000926 separation method Methods 0.000 claims description 5
- 238000010998 test method Methods 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 4
- 235000008694 Humulus lupulus Nutrition 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims 1
- 238000012423 maintenance Methods 0.000 abstract description 9
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 230000008859 change Effects 0.000 abstract description 2
- 238000013461 design Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000007726 management method Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 208000002693 Multiple Abnormalities Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 208000018910 keratinopathic ichthyosis Diseases 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000005295 random walk Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Debugging And Monitoring (AREA)
Abstract
Description
技术领域technical field
本发明属于智能运维领域,具体涉及一种基于动态服务依赖图的异常根因定位方法。The invention belongs to the field of intelligent operation and maintenance, and in particular relates to a method for locating abnormal root causes based on a dynamic service dependency graph.
背景技术Background technique
微服务架构的系统(简称微服务系统)由大量的细粒度的微服务组成,微服务之间协同工作,实现系统功能,被广泛应用于通信、交通、物流、金融等领域。现代互联网应用,往往使用微服务系统作为底层实现。微服务架构的核心设计思想是,将应用拆分成若干个高内聚,低耦合,功能单一,可独立开发、部署、更新的微服务,以此实现应用的敏捷开发与持续交付。微服务架构使得应用的开发和迭代更加的便捷,但也给系统的运行维护带来的新的挑战,即复杂的微服务系统使得应用容易发生异常且难以诊断异常根因。The system of microservice architecture (referred to as microservice system) is composed of a large number of fine-grained microservices, which work together to realize system functions, and are widely used in communication, transportation, logistics, finance and other fields. Modern Internet applications often use microservice systems as the underlying implementation. The core design idea of the microservice architecture is to split the application into several highly cohesive, low-coupling, single-function microservices that can be independently developed, deployed, and updated, so as to achieve agile development and continuous delivery of applications. The microservice architecture makes the development and iteration of applications more convenient, but it also brings new challenges to the operation and maintenance of the system, that is, the complex microservice system makes the application prone to abnormalities and it is difficult to diagnose the root cause of the abnormalities.
微服务系统复杂,具体体现在微服务数量庞大,微服务间依赖关系动态变化,微服务的异构实现。由于微服务的功能单一,为了完成系统的功能,微服务系统可能拥有庞大的微服务数量。例如,Netflix的微服务系统运行了6000多个微服务实例,每天需要处理超过20亿个服务请求;Uber的微服务系统也运行了4000多个微服务实例。同时,微服务系统为应对用户需求的变化与应用的发展,可以动态部署与注册微服务实例,不断有新的微服务加入,使得微服务之间的依赖关系动态变化且错综复杂。并且,微服务系统的服务化接口封装,允许系统内不同微服务采取不同的编程语言、服务框架以及通信机制,以应对不同的应用场景。这种系统内的异构性设计与实现,进一步加深了微服务系统的复杂程度。The microservice system is complex, which is specifically reflected in the large number of microservices, dynamic changes in dependencies between microservices, and heterogeneous implementation of microservices. Due to the single function of microservices, in order to complete the functions of the system, the microservice system may have a large number of microservices. For example, Netflix's microservice system runs more than 6,000 microservice instances and needs to process more than 2 billion service requests every day; Uber's microservice system also runs more than 4,000 microservice instances. At the same time, in order to respond to changes in user needs and application development, the microservice system can dynamically deploy and register microservice instances, and new microservices are constantly added, making the dependencies between microservices dynamic and intricate. Moreover, the service interface encapsulation of the microservice system allows different microservices in the system to adopt different programming languages, service frameworks, and communication mechanisms to cope with different application scenarios. The heterogeneous design and implementation in this system further deepens the complexity of the microservice system.
微服务系统的复杂性,导致了其容易发生异常且难以诊断异常的根因。异常是指,系统的运行状态相比于系统的设计期望发生了偏移。微服务的松耦合设计及异构实现,使得微服务间往往通过网络远程调用的形式完成协作,容易产生由网络不稳定、版本不兼容、配置错误、代码缺陷等问题引发的异常。微服务的数量庞大,进一步加深了上述异常产生的可能性。并且,由于服务间依赖关系复杂且动态变化,任何微小的系统异常均可能引起连锁反应,导致多个异常同时发生,其中只有一个或少数几个异常是引发其他异常的根因。这是由于异常会随着服务间依赖关系不断扩散,形成异常传播链,这些因依赖关系而产生的微服务异常之间存在着因果关系,往往异常传播链的源头是异常根因。为了保障系统的可靠性,当检测到系统运行发生异常,运维人员需要第一时间定位异常的根因,并作出相应的处理,如变更的回滚等。否则,异常可能引起服务中断或服务质量下降,并带来严重的损失。国际数据公司曾报告,一小时的服务中断平均会造成余约10万美元的损失。然而,尽管微服务系统可以提供丰富的运维数据,例如日志、关键性能指标、调用链,但差异化的数据格式、稀疏化的有价值信息以及动态变化的服务依赖关系,使得运维人员难以在复杂的微服务系统里定位异常的根因,严重时导致异常不能及时修复,并给企业带来更严重的经济损害与声誉损害。The complexity of the microservice system makes it prone to exceptions and the root cause of the exceptions is difficult to diagnose. An anomaly means that the operating state of the system deviates from the design expectation of the system. The loosely coupled design and heterogeneous implementation of microservices make the cooperation between microservices often complete through network remote calls, which is prone to abnormalities caused by network instability, version incompatibility, configuration errors, and code defects. The huge number of microservices further increases the possibility of the above abnormalities. Moreover, due to the complex and dynamic changes in the dependencies between services, any minor system abnormality may cause a chain reaction, causing multiple abnormalities to occur at the same time, and only one or a few of them are the root causes of other abnormalities. This is because the exception will continue to spread with the dependencies between services, forming an exception propagation chain. There is a causal relationship between these microservice exceptions caused by dependencies, and the source of the exception propagation chain is often the root cause of the exception. In order to ensure the reliability of the system, when an abnormality is detected in the system operation, the operation and maintenance personnel need to locate the root cause of the abnormality as soon as possible and take corresponding measures, such as rollback of changes. Otherwise, the exception may cause service interruption or service quality degradation, and cause serious losses. The International Data Corporation has reported that an hour of service outage costs an average of about $100,000. However, although the microservice system can provide a wealth of operation and maintenance data, such as logs, key performance indicators, and call chains, the differentiated data format, sparse valuable information, and dynamically changing service dependencies make it difficult for operation and maintenance personnel to Locating the root cause of anomalies in a complex microservice system can lead to serious anomalies that cannot be repaired in time, and bring more serious economic damage and reputation damage to the enterprise.
然而,现有微服务系统的异常根因定位方法,仍然存在不足,往往忽略了服务依赖的动态变化以及根因定位的可解释性。现有方法,大多依赖于服务依赖图。服务依赖图用于刻画服务之间的依赖关系,这种依赖关系也可以用于刻画服务之间的异常传播,从而有助于运维人员进行异常根因定位。基于服务依赖图的异常根因定位首先通过指标、系统日志或追踪数据构造服务依赖图,然后当异常发生时,从服务依赖图中的异常节点出发,通过图搜索、随机游走等算法得到引起异常的候选异常根因集合,然后通过异常分数、与异常节点的相关性或被访问次数等方式对候选异常根因进行排序。然而,现有方法,往往依赖于静态服务依赖图,忽略了服务依赖之间的动态变化,导致构建的服务依赖与真实情况存在差异,限制了现有异常根因定位方法的准确性。与此同时,现有方法,往往只提供了异常的根因服务或根因指标,缺乏可解释性,导致运维人员难以快速地理解报告结果及判断根因的准确性。However, the existing abnormal root cause location methods for microservice systems still have deficiencies, often ignoring the dynamic changes of service dependencies and the interpretability of root cause location. Existing methods mostly rely on service dependency graphs. The service dependency graph is used to describe the dependency relationship between services, which can also be used to describe the exception propagation between services, which helps the operation and maintenance personnel to locate the root cause of the exception. Abnormal root cause location based on the service dependency graph first constructs the service dependency graph through indicators, system logs or tracking data, and then when an exception occurs, starting from the abnormal node in the service dependency graph, the cause is obtained through graph search, random walk and other algorithms A collection of candidate anomaly root causes of anomalies, and then sort the candidate anomaly root causes by anomaly score, correlation with anomalous nodes, or number of visits. However, existing methods often rely on static service dependency graphs, ignoring the dynamic changes between service dependencies, resulting in differences between the constructed service dependencies and the real situation, which limits the accuracy of existing abnormal root cause location methods. At the same time, existing methods often only provide abnormal root cause services or root cause indicators, which lack interpretability, making it difficult for operation and maintenance personnel to quickly understand the report results and determine the accuracy of the root cause.
发明内容Contents of the invention
为了减少异常根因定位不准确导致的额外分析与回滚,降低运维成本,提高异常定位的可解释性,保障微服务系统的可靠性,本发明提供了一种基于动态服务依赖图的异常根因定位方法。基于服务配置信息提取部署时服务依赖关系,基于服务间关键性能指标的相关性及日志频率信息,发现运行时服务依赖关系,动态地构建服务依赖图;基于输入的异常服务及服务依赖图,自动构建异常传播图;基于深度优先搜索构建异常传播路径,并计算异常的根因分数,定位异常根因服务,并报告异常传播路径。In order to reduce the additional analysis and rollback caused by inaccurate location of abnormal root causes, reduce operation and maintenance costs, improve the interpretability of abnormal location, and ensure the reliability of the microservice system, the present invention provides a dynamic service dependency graph based exception Root cause location method. Extract service dependencies during deployment based on service configuration information, discover runtime service dependencies based on the correlation of key performance indicators between services and log frequency information, and dynamically build service dependency graphs; based on input abnormal services and service dependency graphs, automatically Build an anomaly propagation map; construct an anomaly propagation path based on depth-first search, calculate the root cause score of the anomaly, locate the abnormal root cause service, and report the anomaly propagation path.
本发明中的动态服务依赖图构建、异常传播图构建、异常根因定位都是自动进行,无需人工参与,节省了人力成本。本发明的异常根因定位,基于动态服务依赖图,可以更好地捕捉运行时服务依赖关系的变化,有助于刻画更加精确的异常传播关系,提升了异常根因定位的能力;适用于各种类型异常的根因定位,具有很好的通用性;直接定位到异常的服务,并提供异常的传播路径,具有很好的实用性和可解释性。In the present invention, the construction of the dynamic service dependency graph, the construction of the abnormal propagation graph, and the location of the root cause of the abnormality are all carried out automatically without manual participation, which saves labor costs. The abnormal root cause location of the present invention is based on the dynamic service dependency graph, which can better capture the change of the service dependency relationship at runtime, help to describe a more accurate abnormal propagation relationship, and improve the ability to locate the abnormal root cause; it is applicable to various The root cause location of various types of abnormalities has good versatility; it directly locates the abnormal service and provides the propagation path of the abnormality, which has good practicability and explainability.
本发明提供的技术方案是:The technical scheme provided by the invention is:
一种基于动态服务依赖图的异常根因定位方法,其特征在于,包括动态服务依赖图构建、异常传播图构建、异常根因定位;具体步骤包括:An abnormal root cause location method based on a dynamic service dependency graph, characterized in that it includes dynamic service dependency graph construction, exception propagation graph construction, and abnormal root cause location; the specific steps include:
1)动态服务依赖图构建,具体执行如下步骤:1) Dynamic service dependency graph construction, specifically perform the following steps:
11)提取部署时服务依赖:此处的原理是,部署在同一位置(例如,主机)上或者依赖于同一资源(例如,数据库)的两个微服务,存在部署时依赖关系。从配置管理数据库(Configuration Management Database,CMDB)中提取服务的配置信息,包括部署位置信息和关联资源信息。定义待收集配置信息的属性集合K=(k1,k2,...,kn),使得收集到的配置信息ci=(ki,vi)必有ki∈K,其中,ki表示配置信息的属性(元数据),vi表示配置信息的值。记一个服务si的配置集合为Ci=(c1,c2,...,cn),若服务si与服务sj的配置集合存在交集,即Ci∩Cj≠φ,则服务si与服务sj之间存在部署依赖。11) Extract deployment-time service dependencies: The principle here is that two microservices that are deployed on the same location (for example, a host) or depend on the same resource (for example, a database) have deployment-time dependencies. The configuration information of the service is extracted from a configuration management database (Configuration Management Database, CMDB), including deployment location information and associated resource information. Define the attribute set K=(k 1 ,k 2 ,...,k n ) of the configuration information to be collected, so that the collected configuration information c i =( ki ,v i ) must have k i ∈ K, where, k i represents the attribute (metadata) of the configuration information, and v i represents the value of the configuration information. Denote the configuration set of a service s i as C i = (c 1 , c 2 , ..., c n ), if there is an intersection between the configuration sets of service s i and service s j , that is, C i ∩C j ≠φ, Then there is deployment dependency between service s i and service s j .
12)提取运行时服务依赖:此处的原理是,若随着系统负载的变化,两服务产生的日志数量或关键性能指标(Key Performance Indicator,KPI)存在因果关系,则两服务存在运行时依赖,且因果关系的强弱可以表示运行时依赖关系的强弱。此处使用的因果推断算法是PC算法,原因是在给定可靠的条件独立性检验方法的情况下,PC算法可以处理各种类型的数据分布和因果关系,相比于其他因果推断方法具有复杂低、效果好的优势。PC算法通过条件独立性检验判断变量之间的因果关系,再利用d-分离条件确定因果关系之间的方向。此处使用的条件独立性检验方法是G2条件交叉熵度量,如式(1)所示,其服从自由度为D的χ2分布,如式(2)所示。12) Extract runtime service dependencies: The principle here is that if there is a causal relationship between the number of logs or key performance indicators (Key Performance Indicators, KPIs) generated by the two services as the system load changes, then there is a runtime dependency between the two services , and the strength of the causal relationship can represent the strength of the runtime dependency. The causal inference algorithm used here is the PC algorithm. The reason is that given a reliable conditional independence test method, the PC algorithm can handle various types of data distribution and causality, which is complex compared to other causal inference methods. Low, good effect advantages. The PC algorithm judges the causal relationship between variables through the conditional independence test, and then uses the d-separation condition to determine the direction of the causal relationship. The conditional independence test method used here is the G 2 conditional cross-entropy measure, as shown in formula (1), which obeys the χ 2 distribution with D degrees of freedom, as shown in formula (2).
D=(NX-1)(NY-1)ΠZ′∈ZNZ′ 式(2)D=(N X -1)(N Y -1)Π Z'∈Z N Z'Formula (2)
其中,Z是Z′的集合,m是采样的个数。in, Z is a collection of Z', and m is the number of samples.
13)构建动态服务依赖图:每次执行异常根因定位任务时,均会执行步骤11)和步骤12),动态地获取部署时服务依赖和运行时服务依赖。并基于部署时服务依赖和运行时服务依赖,构建服务依赖图G=<V,E>,其中G是有向图(Directed Graph,DG),V表示服务,是G的节点;E表示依赖关系,是G的边。构造规则为,若服务si与服务sj之间存在运行时依赖,则添加si与sj之间的有向边,边的权重为si与sj之间的G2值。若服务si与服务sj之间存在部署时依赖,则添加si到sj及sj到si的有向边,边的权重均为运行时依赖边权的均值。13) Constructing a dynamic service dependency graph: Steps 11) and 12) will be performed each time an abnormal root cause location task is executed, and service dependencies at deployment time and runtime service dependencies are dynamically obtained. And based on the deployment-time service dependency and runtime service dependency, build a service dependency graph G=<V, E>, where G is a directed graph (Directed Graph, DG), V represents a service, which is a node of G; E represents a dependency relationship , is the edge of G. The construction rule is, if there is a runtime dependency between service s i and service s j , then add a directed edge between s i and s j , and the weight of the edge is the G2 value between s i and s j . If there is a deployment-time dependency between service s i and service s j , then add directed edges from s i to s j and from s j to s i , and the weights of the edges are the mean value of the edge weights of the runtime dependencies.
2)异常传播图构建,具体执行如下步骤:2) To construct the exception propagation graph, the specific steps are as follows:
21)构建异常传播图:将输入的异常集合A={a1,a2,...,an}标记在服务依赖图G上,将标为异常的服务节点取出,形成G的子图G′,同时保留节点间的边,节点的值为异常发生时刻的倒数。其中,ai=(si,ti)表示第i个异常,si表示发生异常ai的服务,ti表示发生异常ai的时刻。为了方便计算,将异常发生的时刻进行标准化。设最先发生异常的时刻为1;其余异常按照发生的顺序,发生时刻在1的基础上递增。21) Build an exception propagation graph: mark the input exception set A={a 1 , a 2 ,..., a n } on the service dependency graph G, take out the service nodes marked as exceptions, and form a subgraph of G G', while retaining the edges between nodes, the value of the node is the reciprocal of the time when the abnormality occurs. Wherein, a i =(s i , t i ) represents the i-th abnormality, s i represents the service where the abnormality a i occurs, and t i represents the time when the abnormality a i occurs. In order to facilitate the calculation, the time when the abnormality occurs is standardized. Set the time when the first abnormality occurs as 1; the other abnormalities follow the order of occurrence, and the occurrence time is incremented on the basis of 1.
3)异常根因定位,具体执行如下步骤:3) Locating the root cause of the abnormality, specifically perform the following steps:
31)构建异常传播路径:在异常传播图G′上,随机选择一个异常的服务s′0出发,利用深度优先搜索,寻找候选根因节点及其异常传播路径。31) Constructing an anomaly propagation path: On the anomaly propagation graph G′, randomly select an abnormal service s′ 0 to start, and use depth-first search to find candidate root cause nodes and their anomaly propagation paths.
32)计算异常根因分数:对于s′0,利用异常传播图G′及异常传播路径Path(s′0,s′t),计算每一个s′t∈R的异常根因分数。计算公式如式(3)所示。32) Calculating the abnormal root cause score: For s′ 0 , use the abnormal propagation graph G′ and the abnormal propagation path Path(s′ 0 , s′ t ) to calculate the abnormal root cause score for each s′ t ∈ R. The calculation formula is shown in formula (3).
其中,score(s′t)表示异常服务s′t的异常根因分数,M表示s′0到s′t的异常传播路径个数,N表示传播路径ok中的跳数,wk,i表示第k条传播路径中节点i与节点i-1之间的依赖权重,si表示节点i的值,即异常发生时刻的倒数。此处根因分数计算公式设计的原理是,越早发生的异常越可能是根因,和有越多条传播路径的异常越可能是根因。Among them, score(s′ t ) represents the abnormal root cause score of abnormal service s′ t , M represents the number of abnormal propagation paths from s′ 0 to s′ t , N represents the number of hops in the propagation path o k , w k, i represents the dependency weight between node i and node i-1 in the kth propagation path, and s i represents the value of node i, which is the reciprocal of the moment when the abnormality occurs. The design principle of the root cause score calculation formula here is that the earlier the anomaly occurs, the more likely it is the root cause, and the more likely the anomaly with more propagation paths is the root cause.
33)报告异常根因及传播路径:对异常根因按照分数进行降序排列,并报告与之对应的异常传播路径。33) Report abnormal root cause and propagation path: arrange the abnormal root cause in descending order according to the score, and report the corresponding abnormal propagation path.
本发明进一步提供一种基于动态服务依赖图的异常根因定位系统,其特征在于,包括动态服务依赖图构建模块、异常传播图构建模块、异常根因定位模块。The present invention further provides an anomaly root cause location system based on a dynamic service dependency graph, which is characterized in that it includes a dynamic service dependency graph construction module, an anomaly propagation graph construction module, and an anomaly root cause location module.
动态服务依赖图构建模块,包括部署时服务依赖关系发现器、运行时服务依赖关系发现器、动态服务依赖图构建器、服务依赖图存储器;部署时服务依赖关系发现器使用服务配置信息,发现微服务之间的部署时依赖关系;运行时服务依赖关系发现器使用服务运行日志及关键性能指标,发现微服务之间的运行时依赖关系;动态服务依赖图构建器使用部署时服务依赖关系和运行时服务依赖关系,动态地构造服务依赖图;服务依赖图存储器用于存储服务依赖图;Dynamic service dependency graph building modules, including deployment-time service dependency discoverer, runtime service dependency discoverer, dynamic service dependency graph builder, service dependency graph storage; deployment-time service dependency Deployment-time dependencies between services; runtime service dependency discoverer uses service operation logs and key performance indicators to discover runtime dependencies between microservices; dynamic service dependency graph builder uses deployment-time service dependencies and runtime Time service dependency, dynamically construct service dependency graph; service dependency graph storage is used to store service dependency graph;
异常传播图构建模块,包括异常传播图构建器、异常传播图存储器。异常传播图构建器使用服务依赖图和输入的异常集,构造异常传播图;异常传播图存储器用于存储异常传播图;Exception propagation graph building block, including exception propagation graph builder, exception propagation graph storage. The exception propagation graph builder uses the service dependency graph and the input exception set to construct an exception propagation graph; the exception propagation graph storage is used to store the exception propagation graph;
异常根因定位模块,包括异常传播路径构建器、异常根因计算器、异常根因报告器。异常传播路径构建器使用异常传播图,寻找所有可能的异常根因节点,并构建异常的传播路径;异常根因计算器使用异常传播路径,根据路径的个数、路径上边的权重和异常发生的时刻,计算异常的根因分数;异常根因报告器将异常按照分数降序排序,并报告与之对应的异常传播路径。Abnormal root cause location module, including abnormal propagation path builder, abnormal root cause calculator, and abnormal root cause reporter. The abnormal propagation path builder uses the abnormal propagation graph to find all possible abnormal root cause nodes, and constructs the abnormal propagation path; the abnormal root cause calculator uses the abnormal propagation path, according to the number of paths, the weight on the path and the occurrence At time, the root cause score of the anomaly is calculated; the anomaly root cause reporter sorts the anomalies in descending order of the scores, and reports the corresponding anomaly propagation path.
与现有技术相比,本发明的有益效果是:Compared with prior art, the beneficial effect of the present invention is:
本发明提供了一种基于动态服务依赖图的异常根因定位方法及系统通过读取服务配置信息以及服务运行日志和关键性能指标,动态地构建服务依赖图。并将服务依赖图进行存储,用于后续生成异常传播图。当系统发生异常后,根据服务依赖图和异常集合,构建异常传播图。然后利用深度优先搜索,遍历所有可能的根因异常及其传播路径,利用异常传播路径的个数、路径上边的权重和异常发生的时刻,计算异常的根因分数,按照分数降序排序,并报告与之对应的异常传播路径。本发明能够实现自动构建动态服务依赖图,生成异常传播图,并构建异常传播路径,计算根因分数,以此定位异常根因,并报告相应的传播路径。本发明主要具有以下特点:The invention provides a method for locating abnormal root causes based on a dynamic service dependency graph and the system dynamically builds a service dependency graph by reading service configuration information, service operation logs and key performance indicators. The service dependency graph is stored for subsequent generation of an exception propagation graph. When an exception occurs in the system, an exception propagation graph is constructed based on the service dependency graph and exception collection. Then use depth-first search to traverse all possible root cause anomalies and their propagation paths, use the number of anomaly propagation paths, the weight on the path, and the time when the anomaly occurs to calculate the root cause score of the anomaly, sort in descending order of the score, and report The corresponding exception propagation path. The present invention can automatically build a dynamic service dependency graph, generate an abnormality propagation graph, construct an abnormality propagation path, calculate root cause scores, thereby locate the root cause of the abnormality, and report the corresponding propagation path. The present invention mainly has the following characteristics:
(一)本发明提供的系统和方法以服务配置信息和运行时信息(关键性能指标和日志)为基础,自动构建动态服务依赖图。(1) The system and method provided by the present invention automatically build a dynamic service dependency graph based on service configuration information and runtime information (key performance indicators and logs).
(二)本发明提供的系统和方法能够使用输入的异常和服务依赖图,自动构建异常传播图。(2) The system and method provided by the present invention can use the input exception and service dependency graph to automatically build an exception propagation graph.
(三)本发明提供的系统和方法能够定位异常根因服务的同时,提供异常传播路径,具有很好的实用性和可解释性。(3) The system and method provided by the present invention can not only locate the abnormal root service, but also provide an abnormal propagation path, which has good practicability and explainability.
利用本发明的技术方案,可以实现自动构建动态服务依赖图,构建异常传播图,定位异常根因,并提供异常的传播路径,适用于微服务系统。The technical solution of the present invention can automatically construct a dynamic service dependency graph, construct an anomaly propagation graph, locate the root cause of anomalies, and provide an anomaly propagation path, which is suitable for microservice systems.
附图说明Description of drawings
图1是本发明提供的基于动态服务依赖图的异常根因定位方法;Fig. 1 is an abnormal root cause location method based on a dynamic service dependency graph provided by the present invention;
图2是本发明提供的基于动态服务依赖图的异常根因定位系统。Fig. 2 is an abnormal root cause location system based on a dynamic service dependency graph provided by the present invention.
具体实施方式Detailed ways
下面结合附图,通过实施例进一步描述本发明,但不以任何方式限制本发明的范围。Below in conjunction with accompanying drawing, further describe the present invention through embodiment, but do not limit the scope of the present invention in any way.
图1是本发明提供的基于动态服务依赖图的异常根因定位方法的流程框图。本发明包括动态服务依赖图构建、异常传播图构建、异常根因定位;Fig. 1 is a flow chart of an abnormal root cause location method based on a dynamic service dependency graph provided by the present invention. The invention includes dynamic service dependency graph construction, abnormal propagation graph construction, abnormal root cause location;
动态服务依赖图构建使用服务配置信息以及服务运行日志和关键性能指标,挖掘服务间的部署时依赖关系和运行时依赖关系,动态地构造服务依赖图。配置信息用于描述服务所处的位置,及其所依赖的资源与服务。每个配置项以两元组的形式存在,包括配置属性和配置值。运行日志用于记录系统的运行状况,包括关键变量的输出以及关键运行位置的标记等。运行日志以时间序列文本的形式存在,在本发明中被转换为了日志频率的时间序列的形式。关键性能指标用于监控系统的运行状态,用于监控系统是否发生异常。关键性能指标以时间序列的形式存在。服务依赖图是有向图,刻画了微服务系统中各个服务之间的依赖关系,图的节点是服务,边是服务之间的依赖。服务依赖图随着系统运行数据的积累,动态构建,具有对系统迭代的自适应性。The dynamic service dependency graph construction uses service configuration information, service operation logs and key performance indicators to mine the deployment-time dependencies and runtime dependencies between services, and dynamically constructs the service dependency graph. Configuration information is used to describe the location of the service and the resources and services it depends on. Each configuration item exists in the form of two tuples, including configuration attributes and configuration values. The running log is used to record the running status of the system, including the output of key variables and the marking of key running positions. The operation log exists in the form of time series text, which is converted into the form of time series of log frequency in the present invention. Key performance indicators are used to monitor the operating status of the system and to monitor whether the system is abnormal. Key performance indicators exist in the form of time series. The service dependency graph is a directed graph, which depicts the dependency relationship between various services in the microservice system. The nodes of the graph are services, and the edges are the dependencies between services. The service dependency graph is dynamically constructed with the accumulation of system operation data, and is adaptive to system iteration.
异常传播图构建使用输入的异常集合和服务依赖图,构建异常传播图,用于之后的传播路径分析。每个异常项以两元组的形式存在,包括发生异常的服务和发生异常的时刻。异常传播图也是有向图的形式,是服务依赖图的子图。Exception propagation graph construction uses the input exception collection and service dependency graph to construct an exception propagation graph for subsequent propagation path analysis. Each exception item exists in the form of a two-tuple, including the service where the exception occurred and the time when the exception occurred. The exception propagation graph is also in the form of a directed graph, which is a subgraph of the service dependency graph.
异常根因定位使用异常传播图,寻找所有可能的异常根因节点,并构建与之对应的异常传播路径,用于计算节点的根因分数,并降序排序生成异常根因报告。异常根因报告包含根因服务分数,及其对应的异常传播路径。Abnormal root cause location uses the abnormal propagation graph to find all possible abnormal root cause nodes, and constructs the corresponding abnormal propagation path, which is used to calculate the root cause score of the nodes, and sorts in descending order to generate an abnormal root cause report. An anomaly root cause report contains root cause service scores and their corresponding anomaly propagation paths.
针对上述基于动态服务依赖图的异常根因定位方法,所述动态服务依赖图构建具体执行如下步骤:For the above abnormal root cause location method based on the dynamic service dependency graph, the construction of the dynamic service dependency graph specifically performs the following steps:
11)提取部署时服务依赖:此处的原理是,部署在同一位置(例如,主机)上或者依赖于同一资源(例如,数据库)的两个微服务,存在部署时依赖关系。从配置管理数据库(Configuration Management Database,CMDB)中提取服务的配置信息,包括部署位置信息和关联资源信息。定义待收集配置信息的属性集合K=(k1,k2,...,kn),使得收集到的配置信息ci=(ki,vi)必有ki∈K,其中,ki表示配置信息的属性(元数据),vi表示配置信息的值。记一个服务si的配置集合为Ci=(c1,c2,...,cn),若服务si与服务sj的配置集合存在交集,即Ci∩Cj≠φ,则服务si与服务sj之间存在部署依赖。11) Extract deployment-time service dependencies: The principle here is that two microservices that are deployed on the same location (for example, a host) or depend on the same resource (for example, a database) have deployment-time dependencies. The configuration information of the service is extracted from a configuration management database (Configuration Management Database, CMDB), including deployment location information and associated resource information. Define the attribute set K=(k 1 ,k 2 ,...,k n ) of the configuration information to be collected, so that the collected configuration information c i =( ki ,v i ) must have k i ∈ K, where, k i represents the attribute (metadata) of the configuration information, and v i represents the value of the configuration information. Denote the configuration set of a service s i as C i = (c 1 , c 2 , ..., c n ), if there is an intersection between the configuration sets of service s i and service s j , that is, C i ∩C j ≠φ, Then there is deployment dependency between service s i and service s j .
12)提取运行时服务依赖:此处的原理是,若着系统负载的变化,两服务产生的日志数量或关键性能指标(Key Performance Indicator,KPI)存在因果关系,则两服务存在运行时依赖,且因果关系的强弱可以表示运行时依赖关系的强弱。此处使用的因果推断算法是PC算法,原因是在给定可靠的条件独立性检验方法的情况下,PC算法可以处理各种类型的数据分布和因果关系,相比于其他因果推断方法具有复杂低、效果好的优势。PC算法通过条件独立性检验判断变量之间的因果关系,再利用d-分离条件确定因果关系之间的方向。此处使用的条件独立性检验方法是G2条件交叉熵度量,如式(1)所示,其服从自由度为D的χ2分布,如式(2)所示。12) Extract runtime service dependencies: The principle here is that if there is a causal relationship between the number of logs generated by the two services or the Key Performance Indicator (KPI) as the system load changes, the two services have runtime dependencies. And the strength of the causal relationship can represent the strength of the runtime dependency. The causal inference algorithm used here is the PC algorithm. The reason is that given a reliable conditional independence test method, the PC algorithm can handle various types of data distribution and causality, which is complex compared to other causal inference methods. Low, good effect advantages. The PC algorithm judges the causal relationship between variables through the conditional independence test, and then uses the d-separation condition to determine the direction of the causal relationship. The conditional independence test method used here is the G 2 conditional cross-entropy measure, as shown in formula (1), which obeys the χ 2 distribution with D degrees of freedom, as shown in formula (2).
D=(NX-1)(NY-1)ΠZ′∈ZNZ′ 式(2)D=(N X -1)(N Y -1)Π Z'∈Z N Z'Formula (2)
其中,Z是Z′的集合,m是采样的个数。in, Z is a collection of Z', and m is the number of samples.
121)确定运行时服务依赖关系权重:具体来说,此处使用的KPI是服务级别目标(Service Level Objective,SLO)指标,如服务请求延迟,用于评估一个服务是否运行正常。记一个服务si的KPI时间序列为Ti={t1,t2,...,tn},日志频数序列为Li={l1,l2,...,ln},时间窗口大小为时间窗口个的总数为n。初始时,假设在m个服务中,任意两个服务之间存在因果关系。利用时间序列数据T,计算任意si与sj之间的G2值,并查询其在χ2分布的p值,若p值>ξ,则si与sj之间的条件独立性假设被接受,否则拒绝。若si与sj之间的条件独立性假设被接受,则记录此时条件Z作为si与sj的分割条件S(si,sj)。同理,利用日志频数序列数据L,对si与sj进行条件独立性检验。如果针对si与sj的关于T和L的两次独立性假设均被接受,则判定sai与sj之间没有因果关系,否则判定si与sj之间存在因果关系。遍历所有的si与sj对,直至确定所有的si与sj对之间的因果关系。121) Determine runtime service dependency weights: Specifically, the KPI used here is a Service Level Objective (SLO) indicator, such as service request delay, used to evaluate whether a service is running normally. Record the KPI time series of a service s i as T i ={t 1 , t 2 ,...,t n }, and the log frequency sequence as L i ={l 1 , l 2 ,...,l n }, The time window size is The total number of time windows is n. Initially, it is assumed that among m services, there is a causal relationship between any two services. Using the time series data T, calculate the G 2 value between any s i and s j , and query its p value in the χ 2 distribution, if the p value > ξ, then the conditional independence assumption between s i and s j Accepted, otherwise rejected. If the conditional independence assumption between s i and s j is accepted, record the condition Z at this time as the split condition S(s i , s j ) between s i and s j . Similarly, use the log frequency sequence data L to test the conditional independence of s i and s j . If the two independence assumptions about T and L for s i and s j are accepted, then it is judged that there is no causal relationship between sa i and s j , otherwise it is judged that there is a causal relationship between s i and s j . Traverse all s i and s j pairs until the causal relationship between all s i and s j pairs is determined.
122)确定运行时服务依赖关系方向:之后用d-分离条件确定因果关系之间的方向。d-分离条件共有四条规则,122) Determining the direction of service dependency at runtime: then use the d-separation condition to determine the direction between causal relationships. There are four rules for the d-separation condition,
(1)对于任意不相邻(没有因果关系)的两个变量X和Y,且拥有共同的邻居变量Z,若则将X-Z-Y赋予方向X→Z←Y。(1) For any two variables X and Y that are not adjacent (no causal relationship), and have a common neighbor variable Z, if Then assign XZY to the direction X→Z←Y.
(2)若存在X→Y,则将所有Y-Z赋予方向Y→Z。(2) If X→Y exists, assign all Y-Z to the direction Y→Z.
(3)若存在X→Z→Y,则将所有X-Y赋予方向X→Y。(3) If X→Z→Y exists, assign all X-Y to the direction X→Y.
(4)若同时存在X-Z1→Y和X-Z2→Y,则将所有X-Y赋予方向X→Y。(4) If XZ 1 →Y and XZ 2 →Y exist at the same time, assign all XY to the direction X→Y.
其中规则(1)优先于,规则(2)(3)(4),即确保所有的规则(1)都执行后,再执行规则(2)(3)(4)。规则(2)(3)(4)的执行没有先后顺序。Among them, rule (1) takes precedence over rule (2)(3)(4), that is, to ensure that all rules (1) are executed, and then execute rule (2)(3)(4). Rules (2)(3)(4) are executed in no order.
若执行完上述规则后,仍无法确定si与sj之间依赖的方向,则添加双向依赖,即添加si到sj的依赖关系,同时添加sj到si的依赖关系。If the direction of dependence between s i and s j cannot be determined after executing the above rules, add a two-way dependency, that is, add the dependency relationship from s i to s j , and add the dependency relationship from s j to s i at the same time.
13)构建动态服务依赖图:每次执行异常根因定位任务时,均会执行步骤11)和步骤12),动态地获取部署时服务依赖和运行时服务依赖。并基于部署时服务依赖和运行时服务依赖,构建服务依赖图G=<V,E>,其中G是有向图(Directed Graph,DG),V表示服务,是G的节点;E表示依赖关系,是G的边。构造规则为,若服务si与服务sj之间存在运行时依赖,则添加si与sj之间的有向边,边的权重为si与sj之间的G2值。若服务si与服务sj之间存在部署时依赖,则添加si到sj及sj到si的有向边,边的权重均为运行时依赖边权的均值。13) Constructing a dynamic service dependency graph: Steps 11) and 12) will be performed each time an abnormal root cause location task is executed, and service dependencies at deployment time and runtime service dependencies are dynamically obtained. And based on the deployment-time service dependency and runtime service dependency, build a service dependency graph G=<V, E>, where G is a directed graph (Directed Graph, DG), V represents a service, which is a node of G; E represents a dependency relationship , is the edge of G. The construction rule is, if there is a runtime dependency between service s i and service s j , then add a directed edge between s i and s j , and the weight of the edge is the G2 value between s i and s j . If there is a deployment-time dependency between service s i and service s j , then add directed edges from s i to s j and from s j to s i , and the weights of the edges are the mean value of the edge weights of the runtime dependencies.
针对上述基于动态服务依赖图的异常根因定位方法,所述异常传播图构建具体执行如下步骤:For the above abnormal root cause location method based on the dynamic service dependency graph, the construction of the abnormal propagation graph specifically performs the following steps:
21)构建异常传播图:将输入的异常集合A={a1,a2,...,an}标记在服务依赖图G上,将标为异常的服务节点取出,形成G的子图G′,同时保留节点间的边,节点的值为异常发生时刻的倒数。其中,ai=(si,ti)表示第i个异常,si表示发生异常ai的服务,ti表示发生异常ai的时刻。为了方便计算,将异常发生的时刻进行标准化。设最先发生异常的时刻为1;其余异常按照发生的顺序,发生时刻在1的基础上递增。21) Build an exception propagation graph: mark the input exception set A={a 1 , a 2 ,..., a n } on the service dependency graph G, take out the service nodes marked as exceptions, and form a subgraph of G G', while retaining the edges between nodes, the value of the node is the reciprocal of the time when the abnormality occurs. Wherein, a i =(s i , t i ) represents the i-th abnormality, s i represents the service where the abnormality a i occurs, and t i represents the time when the abnormality a i occurs. In order to facilitate the calculation, the time when the abnormality occurs is standardized. Set the time when the first abnormality occurs as 1; the other abnormalities follow the order of occurrence, and the occurrence time is incremented on the basis of 1.
针对上述基于动态服务依赖图的异常根因定位方法,所述异常根因定位具体执行如下步骤:For the above abnormal root cause location method based on the dynamic service dependency graph, the abnormal root cause location specifically performs the following steps:
31)构建异常传播路径:在异常传播图G′上,随机选择一个异常的服务s′0出发,利用深度优先搜索,寻找候选根因节点及其异常传播路径。31) Constructing an anomaly propagation path: On the anomaly propagation graph G′, randomly select an abnormal service s′ 0 to start, and use depth-first search to find candidate root cause nodes and their anomaly propagation paths.
311)寻找候选根因节点集合:在寻找候选根因节点阶段,初始化,s′i=s′0,ok=φ,R=φ,然后执行如下递归步骤:311) Finding candidate root cause node sets: In the stage of finding candidate root cause nodes, initialize, s' i =s' 0 , o k =φ, R=φ, and then perform the following recursive steps:
(1)若s′i∈ok,则返回上级调用。(1) If s′ i ∈ o k , then return to the superior call.
(2)将s′i加入到ok。(2) Add s′ i to ok .
(3)若s′i不存在相邻的异常节点,则将s′i加入到候选根因节点集合R中。(3) If s' i does not have adjacent abnormal nodes, then add s' i to the set R of candidate root cause nodes.
(4)若s′i存在相邻的异常节点s′j,对每一个s′j,若令s′i=s′j,并执行步骤(1)。(4) If s′ i has an adjacent abnormal node s′ j , for each s′ j , if Set s' i =s' j , and execute step (1).
(5)返回上级调用。(5) Return to the superior call.
递归执行步骤(1)到步骤(5),直到不再有新的异常服务加入到R中。Recursively execute steps (1) to (5) until no new exception service is added to R.
312)确定异常传播路径:在寻找传播路径阶段,对于每一个s′t∈R,记从异常服务s′0到异常服务s′t的异常传播路径集为Path(s′0,s′t)={o1,o2,...,on},其中ok为一条异常传播路径,ok={s′0,...,st′}。初始化,s′i=s′0,ok=φ,Path(s′0,s′t)=φ,然后执行如下递归步骤312) Determine the abnormal propagation path: in the stage of finding the propagation path, for each s′ t ∈ R, record the abnormal propagation path set from the abnormal service s′ 0 to the abnormal service s′ t as Path(s′ 0 , s′ t )={o 1 , o 2 ,..., o n }, where o k is an abnormal propagation path, ok k ={s′ 0 ,..., s t ′}. Initialization, s′ i =s′ 0 , o k =φ, Path(s′ 0 , s′ t )=φ, and then perform the following recursive steps
(1)若s′i∈ok,则返回上级调用。(1) If s′ i ∈ o k , then return to the superior call.
(2)将s′i加入到ok。(2) Add s′ i to ok .
(3)若s′i==s′t,则将ok加入到Path(s′0,s′t)。(3) If s' i == s' t , add o k to Path(s' 0 , s' t ).
(4)若s′i存在相邻的异常服务节点s′j,对每一个s′j,若令s′i=s′j,并执行步骤(1)。(4) If s′ i has an adjacent abnormal service node s′ j , for each s′ j , if Set s' i =s' j , and execute step (1).
(5)返回上级调用。(5) Return to the superior call.
递归执行步骤(1)到步骤(5),直到不再有新的异常传播路径加入到Path(s′0,s′t)中。Steps (1) to (5) are recursively executed until no new anomaly propagation path is added to Path(s′ 0 , s′ t ).
32)计算异常根因分数:对于s′0,利用异常传播图G′及异常传播路径Path(s′0,s′t),计算每一个s′t∈R的异常根因分数。计算公式如式(3)所示。32) Calculating the abnormal root cause score: For s′ 0 , use the abnormal propagation graph G′ and the abnormal propagation path Path(s′ 0 , s′ t ) to calculate the abnormal root cause score for each s′ t ∈ R. The calculation formula is shown in formula (3).
其中,score(s′t)表示异常服务s′t的异常根因分数,M表示s′0到s′t的异常传播路径个数,N表示传播路径ok中的跳数,wk,i表示第k条传播路径中节点i与节点i-1之间的依赖权重,si表示节点i的值,即异常发生时刻的倒数。此处根因分数计算公式设计的原理是,越早发生的异常越可能是根因,和有越多条传播路径的异常越可能是根因。Among them, score(s′ t ) represents the abnormal root cause score of abnormal service s′ t , M represents the number of abnormal propagation paths from s′ 0 to s′ t , N represents the number of hops in the propagation path o k , w k, i represents the dependency weight between node i and node i-1 in the kth propagation path, and s i represents the value of node i, which is the reciprocal of the moment when the abnormality occurs. The design principle of the root cause score calculation formula here is that the earlier the anomaly occurs, the more likely it is the root cause, and the more likely the anomaly with more propagation paths is the root cause.
33)报告异常根因及传播路径:对异常根因按照分数进行降序排列,并报告与之对应的异常传播路径。33) Report abnormal root cause and propagation path: arrange the abnormal root cause in descending order according to the score, and report the corresponding abnormal propagation path.
图2是本发明提供的基于动态服务依赖图的异常根因定位系统的结构框图。Fig. 2 is a structural block diagram of an abnormal root cause location system based on a dynamic service dependency graph provided by the present invention.
本发明提供了一种实现基于动态服务依赖图的异常根因定位方法的系统,系统以配置信息、运行日志、关键性能指标、异常集合作为输入,包括动态服务依赖图构建模块、异常传播图构建模块、异常根因定位模块;The present invention provides a system for realizing an abnormal root cause location method based on a dynamic service dependency graph. The system uses configuration information, operation logs, key performance indicators, and abnormal collections as inputs, and includes a dynamic service dependency graph construction module and an abnormal propagation graph construction module, abnormal root cause location module;
下面分别对不同的模块进行具体说明。The different modules are described in detail below.
S1)动态服务依赖图构建模块S1) Dynamic service dependency graph building blocks
动态服务依赖图构建模块的功能是,基于服务配置信息、运行日志、关键性能指标,构建动态服务依赖图。该模块包含四个子模块:The function of the dynamic service dependency graph building module is to build a dynamic service dependency graph based on service configuration information, operation logs, and key performance indicators. This module contains four submodules:
S11)部署时服务依赖关系发现器S11) Deployment time service dependency finder
部署时服务依赖关系发现器基于服务配置信息,挖掘部署在同一位置或依赖同一资源的微服务之间的部署时依赖关系。The deployment-time service dependency finder mines deployment-time dependencies between microservices that are deployed at the same location or depend on the same resource based on service configuration information.
S12)运行时服务依赖关系发现器S12) Runtime Service Dependency Finder
运行时服务依赖关系发现器基于服务运行日志和关键性能指标,挖掘微服务之间的运行时依赖关系。The runtime service dependency discoverer mines the runtime dependencies between microservices based on service running logs and key performance indicators.
S13)动态服务依赖图构建器S13) Dynamic service dependency graph builder
动态服务依赖图构建器基于服务部署时依赖关系和运行时依赖关系,动态地构建服务依赖图。The dynamic service dependency graph builder dynamically builds a service dependency graph based on service deployment-time dependencies and runtime dependencies.
S14)服务依赖图存储器S14) service dependency graph memory
服务依赖图存储器以矩阵的形式存储服务依赖图,并提供对服务依赖图高性能的查询。The service dependency graph storage stores the service dependency graph in the form of a matrix, and provides high-performance query on the service dependency graph.
S2)异常传播图构建模块S2) Exception Propagation Graph Building Blocks
异常传播图构建模块的功能是根据服务依赖图和输入的异常集合,构建异常传播图。该模块包含两个子模块:The function of the exception propagation graph building block is to build an exception propagation graph according to the service dependency graph and the input exception collection. This module contains two submodules:
S21)异常传播图构建器S21) Exception Propagation Graph Builder
异常传播图构建器根据服务依赖图和输入的异常集合,在服务依赖图上标记异常并生成子图,构造异常传播图。The exception propagation graph builder marks exceptions on the service dependency graph and generates subgraphs according to the service dependency graph and the input exception set, and constructs the exception propagation graph.
S22)异常传播图存储器S22) abnormal propagation map memory
异常传播图存储器以矩阵的形式存储异常传播图,并提供对异常传播图高性能的查询。The exception propagation graph memory stores the exception propagation graph in the form of a matrix, and provides high-performance query on the exception propagation graph.
S3)异常根因定位模块S3) abnormal root cause location module
异常根因定位模块的功能是根据异常传播图,寻找所有可能的异常根因节点,并构建其异常传播路径,计算根因分数,生成异常根因报告。该模块分为三个子模块:The function of the abnormal root cause location module is to find all possible abnormal root cause nodes according to the abnormal propagation graph, construct their abnormal propagation paths, calculate the root cause score, and generate an abnormal root cause report. This module is divided into three submodules:
S31)异常传播路径构建器S31) Exception propagation path builder
异常传播路径构建器使用异常传播图,寻找所有可能的异常根因节点,并构建异常的传播路径。The abnormal propagation path builder uses the abnormal propagation graph to find all possible abnormal root cause nodes and constructs the abnormal propagation path.
S32)异常根因计算器S32) abnormal root cause calculator
异常根因计算器使用异常传播路径,根据异常传播路径的个数、路径上边的权重和异常发生的时刻,计算异常的根因分数。The anomaly root cause calculator uses the anomaly propagation path to calculate the root cause score of the anomaly based on the number of anomaly propagation paths, the weight on the path, and the time when the anomaly occurs.
S32)异常根因报告器S32) abnormal root cause reporter
异常根因报告器将异常按照分数降序排序,并报告与之对应的异常传播路径。The anomaly root cause reporter sorts anomalies in descending order of scores and reports the corresponding anomaly propagation paths.
需要注意的是,公布实施例的目的在于帮助进一步理解本发明,但是本领域的技术人员可以理解:在不脱离本发明及所附权利要求的精神和范围内,各种替换和修改都是可能的。因此,本发明不应局限于实施例所公开的内容,本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understand the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims of. Therefore, the present invention should not be limited to the content disclosed in the embodiments, and the protection scope of the present invention is subject to the scope defined in the claims.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211470197.6A CN115756929B (en) | 2022-11-23 | 2022-11-23 | A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211470197.6A CN115756929B (en) | 2022-11-23 | 2022-11-23 | A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115756929A true CN115756929A (en) | 2023-03-07 |
CN115756929B CN115756929B (en) | 2023-06-02 |
Family
ID=85335430
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211470197.6A Active CN115756929B (en) | 2022-11-23 | 2022-11-23 | A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115756929B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116450399A (en) * | 2023-06-13 | 2023-07-18 | 西华大学 | Fault diagnosis and root cause positioning method for micro service system |
CN116820826A (en) * | 2023-08-28 | 2023-09-29 | 北京必示科技有限公司 | Root cause positioning method, device, equipment and storage medium based on call chain |
CN117792696A (en) * | 2023-12-07 | 2024-03-29 | 北京邮电大学 | A method and device for log anomaly detection and location for distributed systems |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103606042A (en) * | 2013-11-18 | 2014-02-26 | 南京理工大学 | Service combination instance migration effectiveness judgment method based on dynamic dependency graph |
CN112698975A (en) * | 2020-12-14 | 2021-04-23 | 北京大学 | Fault root cause positioning method and system of micro-service architecture information system |
CN112787841A (en) * | 2019-11-11 | 2021-05-11 | 华为技术有限公司 | Fault root cause positioning method and device and computer storage medium |
US20220019495A1 (en) * | 2020-07-14 | 2022-01-20 | Microsoft Technology Licensing, Llc | Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph |
EP3951598A1 (en) * | 2020-08-07 | 2022-02-09 | NEC Laboratories Europe GmbH | Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs |
CN115118621A (en) * | 2022-06-27 | 2022-09-27 | 浙江大学 | Micro-service performance diagnosis method and system based on dependency graph |
CN115278741A (en) * | 2022-06-15 | 2022-11-01 | 清华大学 | Fault diagnosis method and device based on multi-mode data dependency relationship |
-
2022
- 2022-11-23 CN CN202211470197.6A patent/CN115756929B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103606042A (en) * | 2013-11-18 | 2014-02-26 | 南京理工大学 | Service combination instance migration effectiveness judgment method based on dynamic dependency graph |
CN112787841A (en) * | 2019-11-11 | 2021-05-11 | 华为技术有限公司 | Fault root cause positioning method and device and computer storage medium |
US20220019495A1 (en) * | 2020-07-14 | 2022-01-20 | Microsoft Technology Licensing, Llc | Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph |
EP3951598A1 (en) * | 2020-08-07 | 2022-02-09 | NEC Laboratories Europe GmbH | Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs |
CN112698975A (en) * | 2020-12-14 | 2021-04-23 | 北京大学 | Fault root cause positioning method and system of micro-service architecture information system |
CN115278741A (en) * | 2022-06-15 | 2022-11-01 | 清华大学 | Fault diagnosis method and device based on multi-mode data dependency relationship |
CN115118621A (en) * | 2022-06-27 | 2022-09-27 | 浙江大学 | Micro-service performance diagnosis method and system based on dependency graph |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116450399A (en) * | 2023-06-13 | 2023-07-18 | 西华大学 | Fault diagnosis and root cause positioning method for micro service system |
CN116450399B (en) * | 2023-06-13 | 2023-08-22 | 西华大学 | Microservice system fault diagnosis and root cause location method |
CN116820826A (en) * | 2023-08-28 | 2023-09-29 | 北京必示科技有限公司 | Root cause positioning method, device, equipment and storage medium based on call chain |
CN116820826B (en) * | 2023-08-28 | 2023-11-24 | 北京必示科技有限公司 | Root cause positioning method, device, equipment and storage medium based on call chain |
CN117792696A (en) * | 2023-12-07 | 2024-03-29 | 北京邮电大学 | A method and device for log anomaly detection and location for distributed systems |
Also Published As
Publication number | Publication date |
---|---|
CN115756929B (en) | 2023-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115756929B (en) | A Method and System for Abnormal Root Cause Location Based on Dynamic Service Dependency Graph | |
CN110888755A (en) | A method and device for finding abnormal root cause nodes in a microservice system | |
US8214372B2 (en) | Determining configuration parameter dependencies via analysis of configuration data from multi-tiered enterprise applications | |
US6697802B2 (en) | Systems and methods for pairwise analysis of event data | |
CN116450399B (en) | Microservice system fault diagnosis and root cause location method | |
CN106716352A (en) | Managing parameter sets | |
CN112528519A (en) | Method, system, readable medium and electronic device for engine quality early warning service | |
CN105117771B (en) | A kind of agricultural machinery fault recognition method based on correlation rule directed acyclic graph | |
US8954311B2 (en) | Arrangements for extending configuration management in large IT environments to track changes proactively | |
CN113326187B (en) | Data-driven memory leakage intelligent detection method and system | |
CN115237717A (en) | Micro-service abnormity detection method and system | |
CN110825817B (en) | Enterprise suspected association judgment method and system | |
CN105260742A (en) | Unified classification method for multiple types of data and system | |
CN116737436A (en) | Microservice system root cause location method and system for hybrid deployment scenarios | |
CN118210772B (en) | Log management method, device, electronic device and storage medium | |
CN114385397A (en) | Micro-service fault root cause positioning method based on fault propagation diagram | |
CN103455593A (en) | Service competitiveness realization system and method based on social contact network | |
CN107239498A (en) | A kind of method for excavating overlapping community's dynamic evolution correlation rule | |
Liu et al. | Social group query based on multi-fuzzy-constrained strong simulation | |
Abul-Basher | Multiple-query optimization of regular path queries | |
Natarajan et al. | A scalable and generic framework to mine top-k representative subgraph patterns | |
EP4339845A1 (en) | Method, apparatus and electronic device for detecting data anomalies, and readable storage medium | |
Javidian et al. | Learning LWF chain graphs: An order independent algorithm | |
Smetsers et al. | Bigger is not always better: on the quality of hypotheses in active automata learning | |
CN117150507A (en) | Vulnerability positioning system and method based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |