CN115756929B - Abnormal root cause positioning method and system based on dynamic service dependency graph - Google Patents

Abnormal root cause positioning method and system based on dynamic service dependency graph Download PDF

Info

Publication number
CN115756929B
CN115756929B CN202211470197.6A CN202211470197A CN115756929B CN 115756929 B CN115756929 B CN 115756929B CN 202211470197 A CN202211470197 A CN 202211470197A CN 115756929 B CN115756929 B CN 115756929B
Authority
CN
China
Prior art keywords
service
abnormal
root cause
anomaly
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211470197.6A
Other languages
Chinese (zh)
Other versions
CN115756929A (en
Inventor
张齐勋
刘洪毅
杨勇
贾统
李影
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202211470197.6A priority Critical patent/CN115756929B/en
Publication of CN115756929A publication Critical patent/CN115756929A/en
Application granted granted Critical
Publication of CN115756929B publication Critical patent/CN115756929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention provides an abnormal root cause positioning method and system based on a dynamic service dependency graph, and belongs to the field of intelligent operation and maintenance. According to the method, the abnormal root cause is positioned, the service dependency relationship in deployment is extracted based on the service configuration information, the service dependency relationship in operation is found based on the correlation of key performance indexes among services and the log frequency information, and the service dependency graph is dynamically constructed; based on the input abnormal service and dynamic service dependency graph, automatically constructing an abnormal propagation graph; and constructing an abnormal propagation path based on depth-first search, calculating the root cause score of the abnormality, positioning the abnormal root cause service, and reporting the abnormal propagation path. The invention can better capture the change of the service dependency relationship during operation, is helpful for describing more accurate exception propagation relationship, and improves the capability of locating the exception root cause; the method is suitable for root cause positioning of various types of anomalies and has good universality; the method can directly locate the abnormal service and provide an abnormal propagation path, and has good practicability and interpretability. And the labor participation is not needed, and the labor cost is saved.

Description

Abnormal root cause positioning method and system based on dynamic service dependency graph
Technical Field
The invention belongs to the field of intelligent operation and maintenance, and particularly relates to an abnormal root cause positioning method based on a dynamic service dependency graph.
Background
The system of the micro-service architecture (called as micro-service system for short) consists of a large number of micro-services with fine granularity, and the micro-services work cooperatively to realize the system function, so that the system is widely applied to the fields of communication, traffic, logistics, finance and the like. Modern internet applications often use micro-service systems as the underlying implementation. The core design idea of the micro-service architecture is that the application is split into a plurality of micro-services which are high in cohesion, low in coupling and single in function and can be independently developed, deployed and updated, so that the agile development and continuous delivery of the application are realized. The micro-service architecture makes the development and iteration of the application more convenient, but also brings new challenges to the operation and maintenance of the system, namely, a complex micro-service system makes the application easy to be abnormal and difficult to diagnose the cause of the abnormality.
The micro service system is complex, and the micro service system is particularly characterized in that the number of the micro services is huge, the dependency relationship among the micro services is dynamically changed, and the heterogeneous realization of the micro services is realized. Because of the single functionality of the micro services, the micro service system may have a huge number of micro services in order to complete the functionality of the system. For example, netflix's micro-service system runs 6000 or more micro-service instances, requiring more than 20 hundred million service requests to be processed per day; the microservice system of Uber also runs 4000 microservice instances. Meanwhile, in order to cope with the change of the user demands and the development of the application, the micro service system can dynamically deploy and register micro service instances, and new micro services are added continuously, so that the dependency relationship between the micro services is dynamically changed and is complicated. And, the service interface of the micro-service system is packaged, so that different micro-services in the system can be allowed to adopt different programming languages, service frameworks and communication mechanisms to cope with different application scenes. The heterogeneous design and implementation within such a system further enhances the complexity of the microservice system.
The complexity of the microservice system results in its susceptibility to anomalies and difficulty in diagnosing the root cause of the anomalies. An anomaly is a shift in the operating state of the system compared to the design expectations of the system. The loose coupling design and heterogeneous realization of the micro services enable the micro services to complete cooperation in a network remote call mode, and anomalies caused by the problems of network instability, version incompatibility, configuration errors, code defects and the like are easy to generate. The number of micro services is large, and the possibility of the occurrence of the abnormality is further enhanced. Moreover, because of the complex and dynamic dependency between services, any minor systematic anomalies may cause a chain reaction, resulting in multiple anomalies occurring simultaneously, with only one or a few anomalies being the root cause of the other anomalies. This is because the anomaly is propagated along with the dependency relationship between services to form an anomaly propagation chain, and there is a causal relationship between the micro-service anomalies generated by the dependency relationship, and the source of the anomaly propagation chain is often the cause of the anomaly. In order to ensure the reliability of the system, when detecting that the system is running abnormally, operation and maintenance personnel need to locate the root cause of the abnormality at the first time and make corresponding processing, such as rollback of change and the like. Otherwise, the anomaly may cause a service interruption or a degradation of service quality, with serious loss. International data corporation reported that an average loss of about $10 per million was caused by an hour of service outage. However, although the micro-service system can provide rich operation and maintenance data, such as logs, key performance indexes and call chains, differentiated data formats, sparse valuable information and dynamically-changing service dependency relationships make it difficult for operation and maintenance personnel to locate the root cause of the abnormality in the complex micro-service system, and the abnormality cannot be repaired in time when the abnormality is serious, and more serious economic damage and reputation damage are brought to enterprises.
However, the existing micro-service system abnormal root cause positioning method still has the defect that service-dependent dynamic changes and root cause positioning interpretability are often ignored. Existing methods mostly rely on service dependency graphs. The service dependency graph is used for describing the dependency relationship between services, and the dependency relationship can also be used for describing the exception propagation between services, so that the operation and maintenance personnel can be helped to locate the exception root cause. The method comprises the steps of firstly constructing a service dependency graph through indexes, system logs or tracking data based on the abnormal root positioning of the service dependency graph, then starting from abnormal nodes in the service dependency graph when an abnormality occurs, obtaining a candidate abnormal root set causing the abnormality through algorithms such as graph searching, random walk and the like, and then sequencing the candidate abnormal root by means of abnormality scores, correlations with the abnormal nodes or accessed times and the like. However, the existing method often depends on a static service dependency graph, ignores dynamic changes among service dependencies, causes differences between constructed service dependencies and real conditions, and limits the accuracy of the existing abnormal root cause positioning method. Meanwhile, the existing method only provides abnormal root cause service or root cause index, and lacks of interpretability, so that operation and maintenance personnel can hardly understand reporting results and judge the accuracy of the root cause rapidly.
Disclosure of Invention
In order to reduce additional analysis and rollback caused by inaccurate positioning of an abnormal root cause, reduce operation and maintenance cost, improve the interpretability of abnormal positioning and ensure the reliability of a micro-service system, the invention provides an abnormal root cause positioning method based on a dynamic service dependency graph. Extracting service dependency relationship in deployment based on service configuration information, discovering service dependency relationship in operation based on correlation of key performance indexes among services and log frequency information, and dynamically constructing a service dependency graph; automatically constructing an exception propagation graph based on the input exception service and the service dependency graph; and constructing an abnormal propagation path based on depth-first search, calculating the root score of the abnormality, positioning the abnormal root service, and reporting the abnormal propagation path.
The dynamic service dependency graph construction, the abnormal propagation graph construction and the abnormal root cause positioning are all automatically carried out, manual participation is not needed, and labor cost is saved. The abnormal root cause positioning method can better capture the change of the service dependency relationship during operation based on the dynamic service dependency graph, is beneficial to describing more accurate abnormal propagation relationship, and improves the capacity of positioning the abnormal root cause; the method is suitable for root cause positioning of various types of anomalies and has good universality; the method can directly locate the abnormal service and provide an abnormal propagation path, and has good practicability and interpretability.
The technical scheme provided by the invention is as follows:
the abnormal root cause positioning method based on the dynamic service dependency graph is characterized by comprising the steps of dynamic service dependency graph construction, abnormal propagation graph construction and abnormal root cause positioning; the method comprises the following specific steps:
1) The dynamic service dependency graph construction specifically comprises the following steps:
11 Extracting deployment-time service dependencies): the principle here is that there are deployment-time dependencies, two micro-services deployed on the same location (e.g., host) or that depend on the same resource (e.g., database). Configuration information for the service is extracted from a configuration management database (Configuration Management Database, CMDB) including deployment location information and associated resource information. Attribute set k= (K) defining configuration information to be collected 1 ,k 2 ,...,k n ) So that the collected configuration information c i =(k i ,v i ) Must have k i E K, where K i Attributes (metadata) representing configuration information, v i A value representing configuration information. Record a service s i Configuration set of (C) i =(c 1 ,c 2 ,...,c n ) If service s i And services s j There is an intersection of the configuration sets of (C) i ∩C j Not equal to phi, then service s i And services s j There is deployment dependency between.
12 Extracting runtime service dependencies: the principle here is that if there is a causal relationship between the number of logs or key performance indicators (Key Performance Indicator, KPI) generated by two services along with the change of the system load, there is a runtime dependency between the two services, and the strength of the causal relationship can represent the strength of the runtime dependency. The causal inference algorithm used here is a PC algorithm, since given a reliable conditional independence verification method, the PC algorithm can handle various types of data distributions and causal relationships, with the advantages of low complexity and good results compared to other causal inference methods. The PC algorithm judges the causal relationship among the variables through the condition independence test, and then the d-separation condition is utilized to determine the direction among the causal relationships. The conditional independence test method used herein is G 2 A conditional cross entropy measure, as shown in formula (1), subject to χ with degree of freedom D 2 The distribution is shown as formula (2).
Figure BDA0003958229260000031
D=(N X -1)(N Y -1)Π Z′∈Z N Z′ (2)
wherein ,
Figure BDA0003958229260000032
z is the set of Z' and m is the number of samples.
13 Building a dynamic service dependency graph: each time an abnormal root cause locating task is performed, steps 11) and 12) are performed to dynamically acquire deployment-time service dependencies and runtime service dependencies. And constructs a service dependency graph g=based on deployment-time service dependencies and runtime service dependencies<V,E>Where G is a Directed Graph (DG), V represents a service, and is a node of G; e represents a dependency relationship, and is an edge of G. The construction rule is that if the service s i And services s j Is in between withRuntime dependency, then add s i And s j Directional edges between the two, the weight of the edges is s i And s j G between 2 Values. If service s i And services s j If there is a deployment time dependency between them, s is added i To s j S j To s i The weights of the edges are the average of the dependent edge weights in running.
2) Constructing an abnormal propagation diagram, specifically executing the following steps:
21 Constructing an anomaly propagation map: abnormal set a= { a to be input 1 ,a 2 ,...,a n Marked on the service dependency graph G, the service node marked as abnormal is taken out to form a sub graph G' of G, meanwhile, the edges among the nodes are reserved, and the value of the node is the reciprocal of the occurrence time of the abnormal. Wherein a is i =(s i ,t i ) Represents the ith anomaly, s i Indicating occurrence of abnormality a i T i Indicating occurrence of abnormality a i Is a time of day (c). For ease of calculation, the time at which the anomaly occurs is normalized. Setting the moment of the first occurrence of the abnormality as 1; the rest of the anomalies are incremented on a 1 basis in the order of occurrence.
3) The abnormal root cause positioning specifically comprises the following steps:
31 Constructing an anomaly propagation path: on the anomaly propagation graph G ', an anomaly service s ' is randomly selected ' 0 Starting, searching candidate root cause nodes and abnormal propagation paths thereof by utilizing depth-first search.
32 Calculating an anomaly root cause score: for s' 0 By using the anomaly propagation map G 'and the anomaly propagation Path (s' 0 ,s′ t ) Calculate each s' t E.r. The calculation formula is shown as formula (3).
Figure BDA0003958229260000041
Wherein score (s' t ) Representing abnormal services s' t M represents s' 0 To s' t N represents the number of the abnormal propagation paths o k Number of hops, w k,i Representing the dependency weight between node i and node i-1 in the kth propagation path, s i The value of the node i, i.e., the inverse of the occurrence time of the abnormality, is represented. The principle of root cause score calculation formula design here is that the earlier an abnormality occurs is more likely to be a root cause, and the more propagation paths there are abnormalities are more likely to be a root cause.
33 Reporting the cause of the anomaly and propagation path: the anomaly root causes are arranged in descending order according to the scores, and the anomaly propagation paths corresponding to the anomaly root causes are reported.
The invention further provides an abnormal root cause positioning system based on the dynamic service dependency graph, which is characterized by comprising a dynamic service dependency graph construction module, an abnormal propagation graph construction module and an abnormal root cause positioning module.
The dynamic service dependency graph construction module comprises a deployment-time service dependency relationship finder, a runtime service dependency relationship finder, a dynamic service dependency graph constructor and a service dependency graph memory; the service dependency relationship discovery device discovers the dependency relationship between micro services in deployment by using the service configuration information; the runtime service dependency relationship finder finds runtime dependency relationships among the micro services by using service running logs and key performance indexes; the dynamic service dependency graph constructor dynamically constructs a service dependency graph by using the deployment time service dependency relationship and the runtime service dependency relationship; the service dependency graph memory is used for storing the service dependency graph;
the abnormal propagation diagram construction module comprises an abnormal propagation diagram constructor and an abnormal propagation diagram memory. An exception propagation graph constructor constructs an exception propagation graph by using the service dependency graph and the input exception set; the exception propagation map memory is used for storing the exception propagation map;
the abnormal root cause positioning module comprises an abnormal propagation path constructor, an abnormal root cause calculator and an abnormal root cause reporter. Using an abnormal propagation graph, an abnormal propagation path constructor searches all possible abnormal root cause nodes and constructs an abnormal propagation path; an anomaly root cause calculator uses an anomaly propagation path to calculate the root cause score of the anomaly according to the number of paths, the weight of the upper edge of the path and the moment when the anomaly occurs; the anomaly root reporter sorts anomalies in descending order of score and reports the corresponding anomaly propagation paths.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides an abnormal root cause positioning method and system based on a dynamic service dependency graph, which dynamically construct the service dependency graph by reading service configuration information, service running logs and key performance indexes. And storing the service dependency graph for subsequent generation of an exception propagation graph. When the system is abnormal, an abnormal propagation diagram is constructed according to the service dependency diagram and the abnormal set. Then, traversing all possible root cause anomalies and propagation paths thereof by utilizing depth-first search, calculating root cause scores of anomalies by utilizing the number of the anomaly propagation paths, the weight of the upper edge of the paths and the occurrence time of the anomalies, sorting according to score descending order, and reporting the anomaly propagation paths corresponding to the anomaly. The invention can automatically construct a dynamic service dependency graph, generate an abnormal propagation graph, construct an abnormal propagation path, calculate root cause scores, locate the abnormal root cause and report the corresponding propagation path. The invention has the following characteristics:
the system and method provided by the invention automatically builds a dynamic service dependency graph based on service configuration information and runtime information (key performance indicators and logs).
The system and the method provided by the invention can automatically construct the exception propagation graph by using the input exception and service dependency graph.
The system and the method provided by the invention can provide an abnormal propagation path while locating the abnormal root cause service, and have good practicability and interpretability.
By utilizing the technical scheme of the invention, the dynamic service dependency graph can be automatically constructed, the abnormal propagation graph can be constructed, the abnormal root cause can be positioned, the abnormal propagation path can be provided, and the method is suitable for a micro service system.
Drawings
FIG. 1 is a dynamic service dependency graph-based anomaly root cause location method provided by the invention;
FIG. 2 is a dynamic service dependency graph based anomaly root cause location system provided by the present invention.
Detailed Description
The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.
FIG. 1 is a flow chart diagram of an anomaly root cause positioning method based on a dynamic service dependency graph. The method comprises the steps of dynamic service dependency graph construction, abnormal propagation graph construction and abnormal root cause positioning;
the dynamic service dependency graph is constructed by using service configuration information, service running logs and key performance indexes, mining the deployment dependency relationship and the running dependency relationship among services, and dynamically constructing the service dependency graph. The configuration information is used to describe the location of the service and the resources and services it depends on. Each configuration item exists in the form of a two-tuple comprising a configuration attribute and a configuration value. The operation log is used for recording the operation condition of the system, including the output of key variables, marks of key operation positions and the like. The running log exists in the form of time series text, which is converted into the time series form of log frequency in the present invention. The key performance index is used for monitoring the running state of the system and is used for monitoring whether the system is abnormal or not. The key performance indicators exist in time series. The service dependency graph is a directed graph, which characterizes the dependency relationship among various services in the micro-service system, the nodes of the graph are services, and the edges are the dependencies among the services. The service dependency graph is dynamically constructed along with the accumulation of system operation data, and has adaptability to system iteration.
The anomaly propagation graph construction uses the input anomaly set and the service dependency graph to construct an anomaly propagation graph for subsequent propagation path analysis. Each exception item exists in the form of a two-tuple, including the service at which the exception occurred and the time at which the exception occurred. The exception propagation graph is also in the form of a directed graph, which is a sub-graph of the service dependency graph.
And using an abnormal root cause positioning graph to find all possible abnormal root cause nodes, constructing an abnormal propagation path corresponding to the abnormal root cause nodes, calculating root cause scores of the nodes, and generating an abnormal root cause report by descending order sequencing. The abnormal root cause report includes the root cause service score and its corresponding abnormal propagation path.
Aiming at the abnormal root cause positioning method based on the dynamic service dependency graph, the construction of the dynamic service dependency graph specifically comprises the following steps:
11 Extracting deployment-time service dependencies): the principle here is that there are deployment-time dependencies, two micro-services deployed on the same location (e.g., host) or that depend on the same resource (e.g., database). Configuration information for the service is extracted from a configuration management database (Configuration Management Database, CMDB) including deployment location information and associated resource information. Attribute set k= (K) defining configuration information to be collected 1 ,k 2 ,...,k n ) So that the collected configuration information c i =(k i ,v i ) Must have k i E K, where K i Attributes (metadata) representing configuration information, v i A value representing configuration information. Record a service s i Configuration set of (C) i =(c 1 ,c 2 ,...,c n ) If service s i And services s j There is an intersection of the configuration sets of (C) i ∩C j Not equal to phi, then service s i And services s j There is deployment dependency between.
12 Extracting runtime service dependencies: the principle here is that if there is a causal relationship between the number of logs or key performance indicators (Key Performance Indicator, KPI) generated by two services in response to a change in system load, there is a runtime dependency between the two services, and the strength of the causal relationship may represent the strength of the runtime dependency. The causal inference algorithm used here is a PC algorithm, since given a reliable conditional independence verification method, the PC algorithm can handle various types of data distributions and causal relationships, with the advantages of low complexity and good results compared to other causal inference methods. The PC algorithm judges the causal relationship between the variables through the condition independence test, and then the d-separation bar is utilizedThe pieces determine the direction between the causal relationships. The conditional independence test method used herein is G 2 A conditional cross entropy measure, as shown in formula (1), subject to χ with degree of freedom D 2 The distribution is shown as formula (2).
Figure BDA0003958229260000061
D=(N X -1)(N Y -1)Π Z′∈Z N Z′ (2)
wherein ,
Figure BDA0003958229260000062
z is the set of Z' and m is the number of samples.
121 Determining runtime service dependency weights): in particular, KPIs as used herein are service level objective (Service Level Objective, SLO) indicators, such as service request delays, for evaluating whether a service is functioning properly. Record a service s i Is T in KPI time series i ={t 1 ,t 2 ,...,t n Log frequency sequence L i ={l 1 ,l 2 ,...,l n Time window size of
Figure BDA0003958229260000072
The total number of time window pieces is n. Initially, it is assumed that there is a causal relationship between any two services among m services. Calculating arbitrary s using time-series data T i And s j G between 2 Value and query it at χ 2 The p value of the distribution, if p value > ζ, s i And s j Conditional independence assumptions in between are accepted, otherwise rejected. If s i And s j If the condition independence assumption between them is accepted, then record the condition Z at this time as s i And s j Dividing condition S (S) i ,s j ). Similarly, for s, using log frequency sequence data L i And s j And (5) performing condition independence test. If for s i And s j Twice for T and LIf the independence assumption is accepted, then sa is determined i And s j No causal relation exists between them, otherwise, s is judged i And s j There is a causal relationship between them. Traversing all s i And s j For all s i And s j Causal relationships between pairs.
122 Determining a runtime service dependency direction: the d-separation condition is then used to determine the direction between causal relationships. The d-separation condition has four rules in total,
(1) For two variables X and Y that are arbitrarily non-adjacent (without causal relationship) and that have a common neighbor variable Z, if
Figure BDA0003958229260000071
X-Z-Y is given to the direction X.fwdarw.Z.ltoreq.Y.
(2) If X→Y is present, all Y-Z are given directions Y→Z.
(3) If X→Z→Y is present, all X-Y are given directions X→Y.
(4) If X-Z are simultaneously present 1→Y and X-Z2 And Y, all X-Y are given directions X-Y.
Wherein rule (1) takes precedence over rule (2) (3) (4), i.e. after ensuring that all rules (1) are executed, rule (2) (3) (4) is executed again. The execution of rules (2) (3) (4) is not in order.
If the rule is executed, s cannot be determined i And s j The direction of the dependency is then added with the two-way dependency, i.e. s is added i To s j Is to add s at the same time j To s i Is dependent on the relationship of (a).
13 Building a dynamic service dependency graph: each time an abnormal root cause locating task is performed, steps 11) and 12) are performed to dynamically acquire deployment-time service dependencies and runtime service dependencies. And constructs a service dependency graph g=based on deployment-time service dependencies and runtime service dependencies<V,E>Where G is a Directed Graph (DG), V represents a service, and is a node of G; e represents a dependency relationship, and is an edge of G. The construction rule is that if the service s i And services s j Is stored betweenAt runtime dependent, s is added i And s j Directional edges between the two, the weight of the edges is s i And s j G between 2 Values. If service s i And services s j If there is a deployment time dependency between them, s is added i To s j S j To s i The weights of the edges are the average of the dependent edge weights in running.
Aiming at the abnormal root cause positioning method based on the dynamic service dependency graph, the construction of the abnormal propagation graph specifically comprises the following steps:
21 Constructing an anomaly propagation map: abnormal set a= { a to be input 1 ,a 2 ,...,a n Marked on the service dependency graph G, the service node marked as abnormal is taken out to form a sub graph G' of G, meanwhile, the edges among the nodes are reserved, and the value of the node is the reciprocal of the occurrence time of the abnormal. Wherein a is i =(s i ,t i ) Represents the ith anomaly, s i Indicating occurrence of abnormality a i T i Indicating occurrence of abnormality a i Is a time of day (c). For ease of calculation, the time at which the anomaly occurs is normalized. Setting the moment of the first occurrence of the abnormality as 1; the rest of the anomalies are incremented on a 1 basis in the order of occurrence.
Aiming at the abnormal root cause positioning method based on the dynamic service dependency graph, the abnormal root cause positioning specifically executes the following steps:
31 Constructing an anomaly propagation path: on the anomaly propagation graph G ', an anomaly service s ' is randomly selected ' 0 Starting, searching candidate root cause nodes and abnormal propagation paths thereof by utilizing depth-first search.
311 Searching candidate root cause node sets: in the stage of searching candidate root cause nodes, initializing, s' i =s′ 0 ,o k =Φ, r=Φ, then the following recursive steps are performed:
(1) If s' i ∈o k The upper level call is returned.
(2) Will s' i Added to o k
(3) If s' i If there is no adjacent abnormal node, s 'is set' i And adding the candidate root cause node set R.
(4) If s' i The presence of adjacent outlier nodes s' j For each s' j If (if)
Figure BDA0003958229260000081
Let s' i =s′ j And step (1) is performed.
(5) And returning to an upper level call.
Steps (1) to (5) are recursively performed until no new exception services are added to R.
312 Determining an abnormal propagation path: in the propagation path finding stage, for each s' t E R, record from anomalous service s' 0 To an exception service s' t The set of the abnormal propagation paths of (a) is Path (s' 0 ,s′ t )={o 1 ,o 2 ,...,o n}, wherein ok O is an abnormal propagation path k ={s′ 0 ,...,s t '}. Initializing, s' i =s′ 0 ,o k =φ,Path(s′ 0 ,s′ t ) =Φ, then the following recursive steps are performed
(1) If s' i ∈o k The upper level call is returned.
(2) Will s' i Added to o k
(3) If s' i ==s′ t Then o is k Added to Path (s' 0 ,s′ t )。
(4) If s' i The presence of adjacent abnormal service nodes s' j For each s' j If (if)
Figure BDA0003958229260000082
Let s' i =s′ j And step (1) is performed.
(5) And returning to an upper level call.
Recursively executing steps (1) to (5) until no more new exception propagation paths are added to Path(s′ 0 ,s′ t ) Is a kind of medium.
32 Calculating an anomaly root cause score: for s' 0 By using the anomaly propagation map G 'and the anomaly propagation Path (s' 0 ,s′ t ) Calculate each s' t E.r. The calculation formula is shown as formula (3).
Figure BDA0003958229260000083
Wherein score (s' t ) Representing abnormal services s' t M represents s' 0 To s' t N represents the number of the abnormal propagation paths o k Number of hops, w k,i Representing the dependency weight between node i and node i-1 in the kth propagation path, s i The value of the node i, i.e., the inverse of the occurrence time of the abnormality, is represented. The principle of root cause score calculation formula design here is that the earlier an abnormality occurs is more likely to be a root cause, and the more propagation paths there are abnormalities are more likely to be a root cause.
33 Reporting the cause of the anomaly and propagation path: the anomaly root causes are arranged in descending order according to the scores, and the anomaly propagation paths corresponding to the anomaly root causes are reported.
FIG. 2 is a block diagram of the dynamic service dependency graph based anomaly root cause positioning system provided by the invention.
The invention provides a system for realizing an abnormal root cause positioning method based on a dynamic service dependency graph, which takes configuration information, an operation log, key performance indexes and an abnormal set as input, and comprises a dynamic service dependency graph construction module, an abnormal propagation graph construction module and an abnormal root cause positioning module;
the different modules are specifically described below.
S1) dynamic service dependency graph construction module
The dynamic service dependency graph construction module has the function of constructing a dynamic service dependency graph based on service configuration information, running logs and key performance indexes. The module comprises four sub-modules:
s11) service dependency relationship finder in deployment
The deployment-time service dependency relationship finder mines deployment-time dependency relationships among micro services deployed at the same position or depending on the same resource based on the service configuration information.
S12) runtime service dependency finder
The runtime service dependency relationship finder mines runtime dependency relationships between micro-services based on service execution logs and key performance indicators.
S13) dynamic service dependency graph constructor
The dynamic service dependency graph builder dynamically builds a service dependency graph based on service deployment time dependencies and runtime dependencies.
S14) service dependency graph memory
The service dependency graph memory stores service dependency graphs in a matrix and provides high performance queries for the service dependency graphs.
S2) abnormal propagation diagram construction module
The function of the exception propagation graph construction module is to construct an exception propagation graph according to the service dependency graph and the input exception set. The module comprises two sub-modules:
s21) abnormal propagation map constructor
The anomaly propagation graph constructor marks anomalies on the service dependency graph and generates subgraphs according to the service dependency graph and the input anomaly set, and constructs the anomaly propagation graph.
S22) memory for abnormal propagation map
The exception propagation map memory stores exception propagation maps in the form of a matrix and provides high-performance queries for the exception propagation maps.
S3) abnormal root cause positioning module
The function of the abnormal root cause positioning module is to search all possible abnormal root cause nodes according to an abnormal propagation diagram, construct an abnormal propagation path of the abnormal root cause nodes, calculate root cause scores and generate an abnormal root cause report. The module is divided into three sub-modules:
s31) abnormal propagation path builder
The anomaly propagation path builder uses the anomaly propagation graph to find all possible anomaly root cause nodes and builds an anomaly propagation path.
S32) abnormal root cause calculator
The anomaly root factor calculator calculates the root factor score of the anomaly from the number of anomaly propagation paths, the weight of the upper edge of the path, and the time at which the anomaly occurred, using the anomaly propagation paths.
S32) abnormal root cause reporter
The anomaly root reporter sorts anomalies in descending order of score and reports the corresponding anomaly propagation paths.
It should be noted that the purpose of the disclosed embodiments is to aid further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims (5)

1. The abnormal root cause positioning method based on the dynamic service dependency graph is characterized by comprising the steps of dynamic service dependency graph construction, abnormal propagation graph construction and abnormal root cause positioning; the method comprises the following specific steps:
1) The dynamic service dependency graph construction specifically comprises the following steps:
11 Extracting service dependencies at deployment;
12 Extracting runtime service dependencies;
13 Building a dynamic service dependency graph G): step 11) and step 12) are executed each time an abnormal root cause positioning task is executed, service dependence during deployment and service dependence during running are dynamically obtained, and a service dependence graph G= < V and E >; wherein G is a directed graph, V represents a service, and is a node of G; e represents a dependency relationship, which is an edge of G;
2) Constructing an abnormal propagation diagram, specifically executing the following steps:
constructing an anomaly propagation graph: gathering input anomalies
Figure FDA0004187931700000014
Marked on the service dependency graph G, the service node marked as abnormal is taken out to form a sub graph G' of G, meanwhile, the edges among the nodes are reserved, and the value of the node is the reciprocal of the occurrence time of the abnormal, wherein a i =(s i ,t i ) Represents the ith anomaly, s i Indicating occurrence of abnormality a i T i Indicating occurrence of abnormality a i Is a time of day;
3) The abnormal root cause positioning specifically comprises the following steps:
31 Constructing an anomaly propagation path: on the anomaly propagation graph G ', an anomaly service s ' is randomly selected ' 0 Starting, searching candidate root cause nodes and abnormal propagation paths thereof by utilizing depth-first search;
32 Calculating an anomaly root cause score: for s' 0 By using the anomaly propagation map G 'and the anomaly propagation Path (s' 0 ,s′ t ) Calculate each s' t The abnormal root cause score of E R, R is a candidate root cause node set, and the calculation formula is shown in formula (1):
Figure FDA0004187931700000011
wherein, from the abnormal service s' 0 To an exception service s' t The set of the abnormal propagation paths of (a) is Path (s' 0 ,s′ t )={o 1 ,o 2 ,...,o n },o k O is an abnormal propagation path k ={s′ 0 ,...,s t ′},score(s′ t ) Representing abnormal services s' t M represents s' 0 To s' t N represents the number of the abnormal propagation paths o k Number of hops, w k,i Represents the dependency weight, e, between node i and node i-1 in the kth propagation path i The value representing the node i, i.e., the reciprocal of the occurrence time of the abnormality;
33 Reporting the cause of the anomaly and propagation path: the anomaly root causes are arranged in descending order according to the scores, and the anomaly propagation paths corresponding to the anomaly root causes are reported.
2. The abnormal root cause localization method based on dynamic service dependency graph as set forth in claim 1, wherein the attribute set of the configuration information to be collected is defined in step 11)
Figure FDA0004187931700000012
So that the collected configuration information c i =(k i ,v i ) Must have k i E K, where K i Representing attributes of configuration information, v i Representing the value of the configuration information, recording a service s i Is +.>
Figure FDA0004187931700000013
If service s i And services s j There is an intersection of the configuration sets of (C) i ∩C j Not equal to phi, then service s i And services s j There is deployment dependency between.
3. The abnormal root cause positioning method based on dynamic service dependency graph as set forth in claim 1, wherein step 12) employs a PC algorithm to determine causal relationships between variables by a condition independence test, and then determines a direction between causal relationships by using d-separation conditions, the condition independence test used being G 2 A conditional cross entropy measure, as shown in formula (2), subject to χ with degree of freedom D 2 Distribution as shown in formula (3):
Figure FDA0004187931700000021
D=(N X -1)(N Y -1)П Z′∈Z N Z′ (3)
wherein ,
Figure FDA0004187931700000022
z is the set of Z' and m is the number of samples. />
4. The method for locating abnormal root causes based on dynamic service dependency graph as recited in claim 3, wherein the construction rule in step 13) is that if service s i And services s j Runtime dependencies exist between them, then s is added i And s j Directional edges between the two, the weight of the edges is s i And s j G between 2 Value, if service s i And services s j If there is a deployment time dependency between them, s is added i To s j S j To s i The weights of the edges are the average of the dependent edge weights in running.
5. The method for locating an anomaly root cause based on a dynamic service dependency graph as claimed in claim 1, wherein the step 2) normalizes the occurrence time of the anomaly, and sets the occurrence time of the anomaly to 1 first; the rest of the anomalies are incremented on a 1 basis in the order of occurrence.
CN202211470197.6A 2022-11-23 2022-11-23 Abnormal root cause positioning method and system based on dynamic service dependency graph Active CN115756929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211470197.6A CN115756929B (en) 2022-11-23 2022-11-23 Abnormal root cause positioning method and system based on dynamic service dependency graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211470197.6A CN115756929B (en) 2022-11-23 2022-11-23 Abnormal root cause positioning method and system based on dynamic service dependency graph

Publications (2)

Publication Number Publication Date
CN115756929A CN115756929A (en) 2023-03-07
CN115756929B true CN115756929B (en) 2023-06-02

Family

ID=85335430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211470197.6A Active CN115756929B (en) 2022-11-23 2022-11-23 Abnormal root cause positioning method and system based on dynamic service dependency graph

Country Status (1)

Country Link
CN (1) CN115756929B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116450399B (en) * 2023-06-13 2023-08-22 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN116820826B (en) * 2023-08-28 2023-11-24 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103606042B (en) * 2013-11-18 2016-08-17 南京理工大学 Services Composition instance migration availability deciding method based on dynamic dependency graph
CN112787841B (en) * 2019-11-11 2022-04-05 华为技术有限公司 Fault root cause positioning method and device and computer storage medium
US11704185B2 (en) * 2020-07-14 2023-07-18 Microsoft Technology Licensing, Llc Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph
US11500888B2 (en) * 2020-08-07 2022-11-15 NEC Laboratories Europe GmbH Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs
CN115278741A (en) * 2022-06-15 2022-11-01 清华大学 Fault diagnosis method and device based on multi-mode data dependency relationship
CN115118621B (en) * 2022-06-27 2023-05-09 浙江大学 Dependency graph-based micro-service performance diagnosis method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system

Also Published As

Publication number Publication date
CN115756929A (en) 2023-03-07

Similar Documents

Publication Publication Date Title
CN115756929B (en) Abnormal root cause positioning method and system based on dynamic service dependency graph
US10467084B2 (en) Knowledge-based system for diagnosing errors in the execution of an operation
US8098585B2 (en) Ranking the importance of alerts for problem determination in large systems
US6697802B2 (en) Systems and methods for pairwise analysis of event data
US8028061B2 (en) Methods, systems, and computer program products extracting network behavioral metrics and tracking network behavioral changes
Burattin et al. Business models enhancement through discovery of roles
Kumar et al. 2scent: An efficient algorithm to enumerate all simple temporal cycles
US10592327B2 (en) Apparatus, system, and method for analyzing logs
CN116450399B (en) Fault diagnosis and root cause positioning method for micro service system
US20130054220A1 (en) Arrangements for extending configuration management in large it environments to track changes proactively
Baragona et al. Fitting piecewise linear threshold autoregressive models by means of genetic algorithms
CN113467421A (en) Method for acquiring micro-service health status index and micro-service abnormity diagnosis method
Ashraf et al. WeFreS: weighted frequent subgraph mining in a single large graph
CN115118621A (en) Micro-service performance diagnosis method and system based on dependency graph
WO2021109874A1 (en) Method for generating topology diagram, anomaly detection method, device, apparatus, and storage medium
Notaro et al. LogRule: Efficient structured log mining for root cause analysis
CN108733707A (en) A kind of determining function of search stability and device
Yilmaz et al. Generating Performance Improvement Suggestions by using Cross-Organizational Process Mining.
US20230306343A1 (en) Business process management system and method thereof
CN105868328B (en) Method and apparatus for log correlation analysis
CN115577364A (en) Vulnerability mining method for result fusion of multiple static analysis tools
Javidian et al. Learning LWF chain graphs: an order independent algorithm
CN114637649A (en) Alarm root cause analysis method and device based on OLTP database system
CN114282675A (en) Dynamic causal network construction method, system and storage medium
Gu et al. Subgraph similarity maximal all-matching over a large uncertain graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant