CN115756929A - Abnormal root cause positioning method and system based on dynamic service dependency graph - Google Patents

Abnormal root cause positioning method and system based on dynamic service dependency graph Download PDF

Info

Publication number
CN115756929A
CN115756929A CN202211470197.6A CN202211470197A CN115756929A CN 115756929 A CN115756929 A CN 115756929A CN 202211470197 A CN202211470197 A CN 202211470197A CN 115756929 A CN115756929 A CN 115756929A
Authority
CN
China
Prior art keywords
abnormal
service
root cause
graph
propagation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211470197.6A
Other languages
Chinese (zh)
Other versions
CN115756929B (en
Inventor
张齐勋
刘洪毅
杨勇
贾统
李影
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202211470197.6A priority Critical patent/CN115756929B/en
Publication of CN115756929A publication Critical patent/CN115756929A/en
Application granted granted Critical
Publication of CN115756929B publication Critical patent/CN115756929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides an abnormal root cause positioning method and system based on a dynamic service dependency graph, and belongs to the field of intelligent operation and maintenance. The method comprises the following steps of positioning an abnormal root cause, extracting a service dependency relationship during deployment based on service configuration information, discovering the service dependency relationship during operation based on the correlation of key performance indexes among services and log frequency information, and dynamically constructing a service dependency graph; automatically constructing an abnormal propagation diagram based on the input abnormal service and dynamic service dependency diagram; and constructing an abnormal propagation path based on depth-first search, calculating the abnormal root cause score, positioning abnormal root cause service, and reporting the abnormal propagation path. The method can better capture the change of the service dependency relationship during operation, is beneficial to depicting more accurate abnormal propagation relationship, and improves the capability of positioning abnormal root causes; the method is suitable for positioning various abnormal root causes, and has good universality; the method directly locates abnormal services and provides abnormal propagation paths, and has good practicability and interpretability. And the manual participation is not needed, and the labor cost is saved.

Description

Abnormal root cause positioning method and system based on dynamic service dependency graph
Technical Field
The invention belongs to the field of intelligent operation and maintenance, and particularly relates to an abnormal root cause positioning method based on a dynamic service dependency graph.
Background
The micro-service architecture system (micro-service system for short) is composed of a large number of fine-grained micro-services, the micro-services cooperate with each other to realize the system function, and the system is widely applied to the fields of communication, traffic, logistics, finance and the like. Modern internet applications often use microservice systems as the bottom layer implementation. The core design idea of the micro-service architecture is that the application is divided into a plurality of micro-services which have high cohesion, low coupling and single function and can be independently developed, deployed and updated, so that agile development and continuous delivery of the application are realized. The micro-service architecture enables development and iteration of the application to be more convenient, but also brings a new challenge to operation and maintenance of the system, namely the complex micro-service system enables the application to be easy to be abnormal and is difficult to diagnose the abnormal root cause.
The micro-service system is complex, and the micro-service system is embodied in the large number of micro-services, the dynamic change of the dependency relationship among the micro-services and the heterogeneous realization of the micro-services. Due to the single function of the micro-service, the micro-service system may have a huge number of micro-services in order to complete the functions of the system. For example, netflix's microservice system runs 6000 microservice instances, requiring more than 20 million service requests to be processed per day; the microservice system of Uber also runs 4000 microservice instances. Meanwhile, in order to deal with the change of user requirements and the development of applications, the micro-service system can dynamically deploy and register micro-service instances, and new micro-services are continuously added, so that the dependency relationship between the micro-services is dynamically changed and is complicated. Moreover, the service interface of the micro-service system is encapsulated, so that different micro-services in the system are allowed to adopt different programming languages, service frameworks and communication mechanisms to deal with different application scenes. The heterogeneous design and implementation in the system further deepens the complexity of the micro service system.
The complexity of the microservice system makes it prone to anomalies and difficult to diagnose the root cause of the anomalies. An anomaly is an offset in the operating state of the system from the design expectations of the system. The loose coupling design and heterogeneous implementation of the micro-services enable the cooperation among the micro-services to be completed in a remote network calling mode, and are easy to generate abnormity caused by the problems of network instability, version incompatibility, configuration errors, code defects and the like. The number of micro services is large, and the possibility of the occurrence of the above-mentioned abnormality is further increased. Moreover, because the dependency relationship between services is complex and dynamically changed, any tiny system abnormality can cause chain reaction, so that multiple abnormalities occur simultaneously, wherein only one or a few abnormalities are root causes of other abnormalities. The reason is that the exception is diffused with the dependency relationship among the services to form an exception propagation chain, a causal relationship exists among the microservice exceptions generated by the dependency relationship, and the source of the exception propagation chain is often the exception root. In order to ensure the reliability of the system, when it is detected that the system is abnormal in operation, the operation and maintenance personnel need to locate the root cause of the abnormality at the first time and perform corresponding processing, such as rollback of the change. Otherwise, the anomaly may cause service interruption or service quality degradation and cause serious loss. International data corporation has reported that an average one hour service outage can result in a loss of about $ 10 million. However, although the microservice system can provide rich operation and maintenance data such as logs, key performance indexes and call chains, the differentiated data format, the sparse valuable information and the dynamically changing service dependency relationship make it difficult for operation and maintenance personnel to locate the root cause of an anomaly in the complex microservice system, which may lead to the anomaly not being repaired in time in case of serious conditions and bring more serious economic damage and reputation damage to enterprises.
However, the abnormal root cause positioning method of the existing micro-service system still has the defects that the dynamic change of service dependence and the interpretability of root cause positioning are often ignored. The existing methods mostly rely on a service dependency graph. The service dependency graph is used for describing the dependency relationship among the services, and the dependency relationship can also be used for describing the abnormal propagation among the services, so that the operation and maintenance personnel can be facilitated to carry out abnormal root cause positioning. The abnormal root cause positioning based on the service dependency graph firstly constructs the service dependency graph through indexes, system logs or tracking data, then when an abnormality occurs, starting from an abnormal node in the service dependency graph, a candidate abnormal root cause set causing the abnormality is obtained through algorithms such as graph searching and random walk, and then the candidate abnormal root causes are sequenced through modes such as abnormal scores, correlation with the abnormal node or access times. However, the existing method often depends on a static service dependency graph, dynamic changes among service dependencies are ignored, so that the constructed service dependencies are different from the actual situation, and the accuracy of the existing abnormal root cause positioning method is limited. Meanwhile, the existing method only provides abnormal root cause service or root cause indexes, and lacks interpretability, so that operation and maintenance personnel are difficult to quickly understand the report result and judge the accuracy of the root cause.
Disclosure of Invention
In order to reduce additional analysis and rollback caused by inaccurate positioning of the abnormal root, reduce operation and maintenance cost, improve interpretability of abnormal positioning and guarantee reliability of a micro-service system, the invention provides an abnormal root positioning method based on a dynamic service dependency graph. Extracting a service dependency relationship during deployment based on the service configuration information, discovering a service dependency relationship during operation based on the correlation of key performance indexes among services and log frequency information, and dynamically constructing a service dependency graph; automatically constructing an exception propagation diagram based on the input exception service and the service dependency diagram; and constructing an abnormal propagation path based on depth-first search, calculating the abnormal root cause score, positioning abnormal root cause service, and reporting the abnormal propagation path.
The dynamic service dependency graph construction, the abnormal propagation graph construction and the abnormal root cause positioning are automatically carried out, manual participation is not needed, and the labor cost is saved. The abnormal root cause positioning method is based on the dynamic service dependency graph, can better capture the change of the service dependency relationship in operation, is beneficial to describing a more accurate abnormal propagation relationship, and improves the capacity of abnormal root cause positioning; the method is suitable for positioning various abnormal root causes, and has good universality; the method has the advantages of directly positioning abnormal services and providing abnormal propagation paths, along with good practicability and interpretability.
The technical scheme provided by the invention is as follows:
an abnormal root cause positioning method based on a dynamic service dependency graph is characterized by comprising the steps of constructing the dynamic service dependency graph, constructing an abnormal propagation graph and positioning abnormal root causes; the method comprises the following specific steps:
1) Constructing a dynamic service dependency graph, and specifically executing the following steps:
11 Extract deployment-time service dependencies: the principle here is that there are two microservices deployed on the same location (e.g., host) or relying on the same resource (e.g., database), with a dependency at the time of deployment. Configuration information of the service, including deployment location information and associated resource information, is extracted from a Configuration Management Database (CMDB). Defining a set of attributes K = (K) for configuration information to be collected 1 ,k 2 ,...,k n ) So that the collected configuration information c i =(k i ,v i ) Must have k i E.g. K, wherein K i Attribute (metadata) representing configuration information, v i A value representing configuration information. Remembering a service s i Is C i =(c 1 ,c 2 ,...,c n ) If service s i And service s j There is an intersection of the configuration sets of, i.e. C i ∩C j Not equal to phi, then service s i And service s j There is a deployment dependency between.
12 Extract runtime service dependencies: the principle here is that if the number of logs generated by two services or a Key Performance Indicator (KPI) have a causal relationship with the change of the system load, runtime dependencies exist between the two services, and the strength of the causal relationship can indicate the strength of the runtime dependencies. The causal inference algorithm used here is a PC algorithm, because given a reliable conditional independence test method, the PC algorithm can handle various types of data distributions and causal relationships, with the advantage of low complexity and good results compared to other causal inference methods. The PC algorithm judges the causal relationship among the variables through condition independence test, and then determines the direction among the causal relationship by using d-separation conditions. The conditional independence test method used here is G 2 Conditional cross entropy measure, as shown in equation (1), subject to a degree of freedom Dχ 2 The distribution is shown in formula (2).
Figure BDA0003958229260000031
D=(N X -1)(N Y -1)Π Z′∈Z N Z′ Formula (2)
wherein ,
Figure BDA0003958229260000032
z is the set of Z' and m is the number of samples.
13 Build a dynamic service dependency graph: step 11) and step 12) are executed each time the abnormal root cause positioning task is executed, and the service dependence at deployment and the service dependence at runtime are dynamically acquired. And constructing a service dependency graph G =based on service dependency at deployment time and service dependency at runtime<V,E>Wherein G is a Directed Graph (DG), V represents a service, and is a node of G; e represents a dependency and is an edge of G. Constructing a rule that if the service s i And service s j If there is a runtime dependency between them, then add s i And s j With directed edges in between, the weight of the edge being s i And s j G between 2 The value is obtained. If service s i And service s j There is a deployment-time dependency between, then s is added i To s j And s j To s i The weights of the edges are all the mean values of the run-time dependent edge weights.
2) Constructing an abnormal propagation diagram, and specifically executing the following steps:
21 Constructing an abnormal propagation map: exception set A = { a) to be input 1 ,a 2 ,...,a n Marking the service dependency graph G, taking out the service nodes marked as the abnormity to form a subgraph G' of the G, and simultaneously reserving edges among the nodes, wherein the values of the nodes are the reciprocal of the abnormity occurrence time. Wherein, a i =(s i ,t i ) Indicates the ith exception, s i Indicating the occurrence of an anomaly a i Service of t i Indicating the occurrence of an anomaly a i The time of day. For facilitating calculation, the abnormality will occurThe time is normalized. Setting the moment of first occurrence of the abnormity as 1; the rest of the exceptions are in the order of occurrence, and the occurrence time is increased on the basis of 1.
3) And (3) positioning the abnormal root cause, specifically executing the following steps:
31 Construct an exception propagation path: randomly selecting one abnormal service s 'on the abnormal propagation graph G' 0 And starting, searching candidate root nodes and abnormal propagation paths thereof by utilizing depth-first search.
32 Compute an abnormal root score: to s' 0 Using the abnormal propagation map G 'and the abnormal propagation Path Path (s' 0 ,s′ t ) Calculate each s' t E.g. the abnormal root score of R. The calculation formula is shown in formula (3).
Figure BDA0003958229260000041
Wherein, score (s' t ) Represents abnormal service s' t M represents s' 0 To s' t N represents the propagation path o k Number of hops in, w k,i Represents the dependency weight between node i and node i-1 in the k-th propagation path, s i The value of node i, i.e. the inverse of the time of occurrence of the anomaly. The principle of root cause score calculation formula design here is that an abnormality that occurs earlier is more likely to be a root cause, and an abnormality with more propagation paths is more likely to be a root cause.
33 Report exception root cause and propagation path: and sorting the abnormal root causes according to the scores in a descending order, and reporting the abnormal propagation path corresponding to the abnormal root causes.
The invention further provides an abnormal root cause positioning system based on the dynamic service dependency graph, which is characterized by comprising a dynamic service dependency graph building module, an abnormal propagation graph building module and an abnormal root cause positioning module.
The dynamic service dependency graph building module comprises a service dependency relationship finder during deployment, a service dependency relationship finder during runtime, a dynamic service dependency graph builder and a service dependency graph memory; the service dependency relationship discovering device discovers the dependency relationship between the micro services when in deployment by using the service configuration information; the run-time service dependency relationship discovery device discovers the run-time dependency relationship among the micro services by using the service run logs and the key performance indexes; the dynamic service dependency graph builder uses the service dependency relationship during deployment and the service dependency relationship during runtime to dynamically construct a service dependency graph; the service dependency graph memory is used for storing a service dependency graph;
and the abnormal propagation map building module comprises an abnormal propagation map builder and an abnormal propagation map memory. The anomaly propagation graph builder uses the service dependency graph and the input anomaly set to construct an anomaly propagation graph; the exception propagation map memory is used for storing an exception propagation map;
and the abnormal root cause positioning module comprises an abnormal propagation path builder, an abnormal root cause calculator and an abnormal root cause reporter. The abnormal propagation path builder uses the abnormal propagation graph to search all possible abnormal root nodes and builds abnormal propagation paths; the abnormal root cause calculator uses the abnormal propagation path and calculates the abnormal root cause score according to the number of the paths, the weight on the upper side of the path and the abnormal occurrence time; and the abnormal root cause reporter sorts the abnormal according to the descending order of the scores and reports the abnormal propagation path corresponding to the abnormal root cause reporter.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides an abnormal root cause positioning method and system based on a dynamic service dependency graph, which dynamically construct the service dependency graph by reading service configuration information, service operation logs and key performance indexes. And storing the service dependency graph for subsequent generation of the exception propagation graph. And when the system is abnormal, constructing an abnormal propagation diagram according to the service dependency diagram and the abnormal set. And traversing all possible root cause anomalies and propagation paths thereof by utilizing depth-first search, calculating the scores of the anomalous root causes by utilizing the number of the anomalous propagation paths, the weights on the upper sides of the paths and the occurrence time of the anomalies, sorting the scores in a descending order according to the scores, and reporting the anomalous propagation paths corresponding to the scores. The invention can realize automatic construction of the dynamic service dependency graph, generation of the abnormal propagation graph, construction of the abnormal propagation path, calculation of the root factor score, positioning of the abnormal root factor and reporting of the corresponding propagation path. The invention mainly has the following characteristics:
the system and the method provided by the invention automatically construct the dynamic service dependency graph on the basis of service configuration information and runtime information (key performance indexes and logs).
And secondly, the system and the method provided by the invention can automatically construct an exception propagation diagram by using the input exception and service dependency diagram.
And thirdly, the system and the method provided by the invention can provide an abnormal propagation path while positioning the abnormal root cause service, and have good practicability and interpretability.
By utilizing the technical scheme of the invention, the automatic construction of the dynamic service dependency graph, the construction of the abnormal propagation graph, the positioning of the abnormal root cause and the provision of the abnormal propagation path can be realized, and the method is suitable for a micro-service system.
Drawings
FIG. 1 is an abnormal root cause positioning method based on dynamic service dependency graph provided by the present invention;
FIG. 2 is an abnormal root cause positioning system based on dynamic service dependency graph provided by the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
FIG. 1 is a flow chart diagram of an abnormal root cause locating method based on a dynamic service dependency graph according to the present invention. The method comprises the steps of dynamic service dependency graph construction, abnormal propagation graph construction and abnormal root cause positioning;
and constructing and using the service configuration information, the service operation log and the key performance index by the dynamic service dependency graph, excavating the deployment time dependency relationship and the operation time dependency relationship among the services, and dynamically constructing the service dependency graph. The configuration information is used to describe the location where the service is located, and the resources and services on which it depends. Each configuration item exists in the form of a two-tuple, including a configuration attribute and a configuration value. The operation log is used for recording the operation condition of the system, and comprises the output of key variables, marks of key operation positions and the like. The running log exists in the form of time series text, and is converted into the form of time series of log frequency in the invention. The key performance index is used for monitoring the running state of the system and monitoring whether the system is abnormal or not. The key performance indicators exist in the form of a time series. The service dependency graph is a directed graph and describes the dependency relationship among various services in the micro service system, nodes of the graph are services, and edges are the dependencies among the services. The service dependency graph is dynamically constructed along with accumulation of system operation data, and has adaptivity to system iteration.
And constructing the abnormal propagation graph by using the input abnormal set and the service dependency graph, and using the abnormal propagation graph for later analysis of the propagation path. Each abnormal item exists in the form of two tuples, including the service where the abnormality occurs and the time when the abnormality occurs. The exception propagation graph is also in the form of a directed graph, which is a subgraph of the service dependency graph.
And (4) positioning the abnormal root cause, searching all possible abnormal root cause nodes by using an abnormal propagation graph, constructing an abnormal propagation path corresponding to the abnormal root cause nodes, calculating the root cause scores of the nodes, and sequencing in a descending manner to generate an abnormal root cause report. The abnormal root cause report comprises a root cause service score and an abnormal propagation path corresponding to the root cause service score.
Aiming at the abnormal root cause positioning method based on the dynamic service dependency graph, the dynamic service dependency graph construction specifically executes the following steps:
11 Extract deployment-time service dependencies: the principle here is that there are two microservices deployed on the same location (e.g., host) or relying on the same resource (e.g., database), with a dependency at the time of deployment. Configuration information of the service, including deployment location information and associated resource information, is extracted from a Configuration Management Database (CMDB). Defining a set of attributes K = (K) for configuration information to be collected 1 ,k 2 ,...,k n ) So that the collected configuration information c i =(k i ,v i ) Must have k i E.g. K, wherein K i Attribute (metadata) indicating configuration information, v i A value representing configuration information. Remembering a service s i Is C i =(c 1 ,c 2 ,...,c n ) If service s i And service s j There is an intersection of the configuration sets of (i.e. C) i ∩C j If not equal to phi, then service s i And service s j There is a deployment dependency between.
12 Extract runtime service dependencies: the principle here is that if the number of logs generated by two services or a Key Performance Indicator (KPI) have a causal relationship due to a change in system load, runtime dependencies exist between the two services, and the strength of the causal relationship may indicate the strength of the runtime dependencies. The causal inference algorithm used here is a PC algorithm, because given a reliable conditional independence test method, the PC algorithm can handle various types of data distributions and causal relationships, with the advantage of low complexity and good results compared to other causal inference methods. The PC algorithm judges the causal relationship among the variables through condition independence test, and then determines the direction among the causal relationship by using d-separation conditions. The conditional independence test method used here is G 2 A conditional cross entropy measure, as shown in equation (1), obeys a χ with a degree of freedom D 2 The distribution is shown in formula (2).
Figure BDA0003958229260000061
D=(N X -1)(N Y -1)Π Z′∈Z N Z′ Formula (2)
wherein ,
Figure BDA0003958229260000062
z is the set of Z' and m is the number of samples.
121 Determine runtime service dependency weights: specifically, KPIs as used herein are Service Level Object (SLO) indicators, such as Service request latency, used to evaluate whether a Service is operating properly. Remember oneService s i KPI time sequence of (1) is T i ={t 1 ,t 2 ,...,t n H, log frequency series is L i ={l 1 ,l 2 ,...,l n Is of time window size
Figure BDA0003958229260000072
The total number of time windows is n. Initially, it is assumed that there is a causal relationship between any two services among the m services. Calculating an arbitrary s using the time series data T i And s j G between 2 Value, and query it at χ 2 P-value of distribution, if p-value > xi, s i And s j Conditional independence between them is assumed to be accepted, otherwise rejected. If s i And s j The condition Z at this time is recorded as s if the condition independence assumption is accepted i And s j Dividing condition S (S) i ,s j ). Similarly, the log frequency sequence data L is used for s i And s j A condition independence test was performed. If for s i And s j Are accepted, then sa is determined i And s j There is no causal relationship between them, otherwise, s is determined i And s j There is a causal relationship between them. Traverse all s i And s j For until all s are determined i And s j Cause and effect relationships between pairs.
122 Determine runtime service dependency direction: the direction between causal relationships is then determined using d-separation conditions. The d-separation conditions have four rules in total,
(1) For any two variables X and Y that are not adjacent (no causal relationship) and have a common neighbor variable Z, if
Figure BDA0003958229260000071
Then X-Z-Y is given the direction X → Z ← Y.
(2) If X → Y is present, all Y-Z are assigned the direction Y → Z.
(3) If X → Z → Y is present, all X-Y are given the direction X → Y.
(4) If X-Z is simultaneously present 1→Y and X-Z2 → Y, all X-Y are given the direction X → Y.
Wherein, the rule (1) has priority over the rules (2), (3) and (4), namely, after all the rules (1) are ensured to be executed, the rules (2), (3) and (4) are executed. The rules (2), (3) and (4) are executed without precedence.
If the above rule is executed, s still cannot be determined i And s j The direction of the dependency between them, then add the two-way dependency, i.e. add s i To s j While adding s j To s i The dependency of (c).
13 Build a dynamic service dependency graph: step 11) and step 12) are executed each time the abnormal root cause positioning task is executed, and the service dependence at deployment and the service dependence at runtime are dynamically acquired. And constructing a service dependency graph G =based on service dependency at deployment time and service dependency at runtime<V,E>Wherein G is a Directed Graph (DG), V represents a service, and is a node of G; e represents a dependency and is an edge of G. The construction rule is that if the service s i And service s j If there is a runtime dependency between them, then add s i And s j With directed edges in between, the weight of the edge being s i And s j G between 2 The value is obtained. If service s i And service s j With deployment-time dependencies in between, then add s i To s j And s j To s i The weights of the edges are all the mean values of the run-time dependent edge weights.
Aiming at the abnormal root cause positioning method based on the dynamic service dependency graph, the construction of the abnormal propagation graph specifically executes the following steps:
21 Constructing an abnormal propagation map: exception set A = { a) to be input 1 ,a 2 ,...,a n Marking the service dependency graph G, taking out the service nodes marked as the abnormity to form a subgraph G' of the G, and simultaneously reserving edges among the nodes, wherein the values of the nodes are the reciprocal of the abnormity occurrence time. Wherein, a i =(s i ,t i ) Indicates the ith exception, s i Indicates the occurrence of an anomaly a i Service of (a), t i Indicating the occurrence of an anomaly a i Time of day (c). For convenience of calculation, the time when the abnormality occurs is normalized. Setting the moment of first occurrence of the abnormity as 1; the rest of the exceptions are in the order of occurrence, and the occurrence time is increased on the basis of 1.
For the above abnormal root cause positioning method based on the dynamic service dependency graph, the abnormal root cause positioning specifically executes the following steps:
31 Construct an exception propagation path: randomly selecting an abnormal service s 'on the abnormal propagation graph G' 0 And starting, searching candidate root nodes and abnormal propagation paths thereof by utilizing depth-first search.
311 Find a set of candidate root nodes: initializing s 'in the stage of searching candidate root nodes' i =s′ 0 ,o k = Φ, R = Φ, then the following recursion steps are performed:
(1) If's' i ∈o k Then the superior call is returned.
(2) S' i Is added to o k
(3) If's' i If there is no adjacent abnormal node, s' i And adding the root node into the candidate root node set R.
(4) If's' i Adjacent abnormal node s 'exists' j To each s' j If, if
Figure BDA0003958229260000081
S' i =s′ j And step (1) is executed.
(5) And returning to the superior call.
And (5) recursively executing the steps (1) to (5) until no new abnormal service is added to the R.
312 Determine an abnormal propagation path: in the find propagation path phase, for each s' t Belongs to R, records abnormal service s' 0 To abnormal service s' t The set of abnormal propagation paths of (2) is Path (s' 0 ,s′ t )={o 1 ,o 2 ,...,o n}, wherein ok In order to be a path for the propagation of an anomaly,o k ={s′ 0 ,...,s t '}. Initialization of s' i =s′ 0 ,o k =φ,Path(s′ 0 ,s′ t ) = phi, then the following recursive steps are performed
(1) If's' i ∈o k Then the upper level call is returned.
(2) S' i Is added to o k
(3) If's' i ==s′ t Then o will be k Added to Path (s' 0 ,s′ t )。
(4) If's' i Adjacent abnormal service node s 'exists' j To each s' j If, if
Figure BDA0003958229260000082
S' i =s′ j And step (1) is executed.
(5) And returning to the superior call.
Recursively executing steps (1) through (5) until no more new exception propagation paths join Path (s' 0 ,s′ t ) In (1).
32 Calculate an abnormal root score: to s' 0 Using the abnormal propagation map G 'and the abnormal propagation Path Path (s' 0 ,s′ t ) Calculate each s' t E.g. abnormal root score of R. The calculation formula is shown in formula (3).
Figure BDA0003958229260000083
Wherein, score (s' t ) Denotes abnormal service s' t M represents s' 0 To s' t N represents the propagation path o k Number of hops in, w k,i Represents the dependency weight between node i and node i-1 in the k-th propagation path, s i The value of node i, i.e., the reciprocal of the time of occurrence of the anomaly, is represented. The principle of the root cause score calculation formula design here is that the earlier an abnormality occurs, the moreMay be a root cause, and the more propagation path exceptions are likely to be root causes.
33 Report exception root cause and propagation path: and sorting the abnormal root causes according to the scores in a descending order, and reporting the abnormal propagation path corresponding to the abnormal root causes.
Fig. 2 is a block diagram of an anomaly root cause positioning system based on a dynamic service dependency graph according to the present invention.
The invention provides a system for realizing an abnormal root cause positioning method based on a dynamic service dependency graph, which takes configuration information, an operation log, a key performance index and an abnormal set as input and comprises a dynamic service dependency graph construction module, an abnormal propagation graph construction module and an abnormal root cause positioning module;
the different modules are described in detail below.
S1) dynamic service dependency graph building module
The dynamic service dependency graph building module has the function of building a dynamic service dependency graph based on service configuration information, operation logs and key performance indexes. The module contains four sub-modules:
s11) service dependency relationship discovering device during deployment
The service dependency relationship discovering device excavates the dependency relationship when deploying among the micro-services which are deployed at the same position or depend on the same resource based on the service configuration information.
S12) runtime service dependency relationship discovery device
The runtime service dependency relationship finder excavates runtime dependency relationships between microservices based on the service running logs and the key performance indexes.
S13) dynamic service dependency graph builder
The dynamic service dependency graph builder dynamically builds a service dependency graph based on the service deployment time dependency relationship and the runtime dependency relationship.
S14) service dependency graph memory
The service dependency graph memory stores the service dependency graphs in the form of a matrix and provides high-performance queries for the service dependency graphs.
S2) abnormal propagation map construction module
The function of the abnormal propagation graph building module is to build an abnormal propagation graph according to the service dependency graph and the input abnormal set. The module contains two sub-modules:
s21) abnormal propagation diagram builder
And the anomaly propagation graph builder marks the anomalies on the service dependency graph and generates a sub-graph according to the service dependency graph and the input anomaly set, so as to construct an anomaly propagation graph.
S22) abnormal propagation map memory
The abnormal propagation graph memory stores the abnormal propagation graph in a matrix form and provides high-performance query for the abnormal propagation graph.
S3) abnormal root cause positioning module
The abnormal root cause positioning module has the functions of searching all possible abnormal root cause nodes according to the abnormal propagation graph, constructing abnormal propagation paths of the abnormal root cause nodes, calculating root cause scores and generating abnormal root cause reports. The module is divided into three sub-modules:
s31) abnormal propagation path builder
The abnormal propagation path builder uses the abnormal propagation graph to find all possible abnormal root nodes and builds an abnormal propagation path.
S32) abnormal root cause calculator
The abnormal root cause calculator calculates a root cause score of the abnormality based on the number of the abnormal propagation paths, the weight on the path, and the time when the abnormality occurs, using the abnormal propagation paths.
S32) abnormal root cause reporter
And the abnormal root cause reporter sorts the abnormal according to the descending order of the scores and reports the abnormal propagation path corresponding to the abnormal root cause reporter.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (7)

1. An abnormal root cause positioning method based on a dynamic service dependency graph is characterized by comprising the steps of constructing the dynamic service dependency graph, constructing an abnormal propagation graph and positioning the abnormal root cause; the method comprises the following specific steps:
1) Constructing a dynamic service dependency graph, and specifically executing the following steps:
11 Extract deployment-time service dependencies;
12 Extract runtime service dependencies;
13 Build a dynamic service dependency graph G: step 11) and step 12) are executed each time the abnormal root cause positioning task is executed, service dependence during deployment and service dependence during operation are dynamically obtained, and a service dependence graph G = < V, E >; wherein G is a directed graph, V represents a service, and is a node of G; e represents a dependency, being an edge of G;
2) Constructing an abnormal propagation diagram, and specifically executing the following steps:
constructing an abnormal propagation map: exception set A = { a) to be input 1 ,a 2 ,...,a n Marking the service dependency graph G, taking out the service nodes marked as abnormal to form a subgraph G' of the G, and simultaneously reserving edges among the nodes, wherein the values of the nodes are the reciprocal of the abnormal occurrence time;
3) And (3) positioning the abnormal root cause, and specifically executing the following steps:
31 Construct an exception propagation path: randomly selecting one abnormal service s 'on the abnormal propagation graph G' 0 Starting, searching candidate root cause nodes and abnormal propagation paths thereof by depth-first search;
32 Compute an abnormal root score: to s' 0 Using the abnormal propagation map G 'and the abnormal propagation Path Path (s' 0 ,s′ t ) Calculate each s' t E.g. abnormal root score of R;
33 Report exception root cause and propagation path: and sorting the abnormal root causes according to the scores in a descending order, and reporting the abnormal propagation path corresponding to the abnormal root causes.
2. The dynamic service dependency graph-based anomaly root cause locating method according to claim 1, wherein in step 11) a set of attributes K = (K) for configuration information to be collected is defined 1 ,k 2 ,...,k n ) So that the collected configuration information c i =(k i ,b i ) Must have k i E.g. K, wherein K i Attribute representing configuration information, v i Value representing configuration information, remembering a service s i Is C i =(c 1 ,c 2 ,...,c n ) If service s i And service s j There is an intersection of the configuration sets of, i.e. C i ∩C j Not equal to phi, then service s i And service s j There is a deployment dependency between.
3. The dynamic service dependency graph-based abnormal root cause locating method as claimed in claim 1, wherein step 12) employs a PC algorithm to judge causal relationships between variables through conditional independence test, and then determines direction between causal relationships using d-separation condition, and the conditional independence test method used is G 2 The conditional cross entropy measure is as shown in formula (1) subject to χ with degree of freedom D 2 Distribution, as shown in formula (2):
Figure FDA0003958229250000011
D=(N X -1)(N Y -1)Π Z′∈Z N Z′ formula (2)
wherein ,
Figure FDA0003958229250000021
z is the set of Z' and m is the number of samples.
4. The dynamic service dependency graph-based anomaly root cause positioning method according to claim 1, wherein in step 13) the construction rule is that if the service s is a service s i And service s j If there is a runtime dependency between them, then add s i And s j With directed edges in between, the weight of the edge being s i And s j G between 2 Value, if service s i And service s j There is a deployment-time dependency between, then s is added i To s j And s j To s i The weights of the edges are all the mean values of the run-time dependent edge weights.
5. The dynamic service dependency graph-based abnormal root cause locating method as claimed in claim 1, wherein in step 2), a i =(s i ,t i ) Indicates the ith exception, s i Indicates the occurrence of an anomaly a i Service of t i Indicating the occurrence of an anomaly a i Standardizing the time when the abnormality occurs, and setting the time when the abnormality occurs firstly as 1; the other exceptions are in the order of occurrence, and the occurrence time is increased on the basis of 1.
6. The method for locating abnormal root cause based on dynamic service dependency graph as claimed in claim 1, wherein the calculation formula in step 32) is shown in formula (3):
Figure FDA0003958229250000022
wherein, score (s' t ) Represents abnormal service s' t M represents s' 0 To s' t N represents the propagation path o k Number of hops in, w k,i Represents the dependency weight between node i and node i-1 in the k-th propagation path, s i The value of node i, i.e. the inverse of the time of occurrence of the anomaly.
7. The abnormal root cause positioning system based on the dynamic service dependency graph is characterized by comprising a dynamic service dependency graph building module, an abnormal propagation graph building module and an abnormal root cause positioning module, wherein:
the dynamic service dependency graph building module comprises a service dependency relationship finder during deployment, a service dependency relationship finder during operation, a dynamic service dependency graph builder and a service dependency graph memory; the service dependency relationship discovering device discovers the dependency relationship between the micro services when deployed by using the service configuration information; the runtime service dependency relationship finder finds runtime dependency relationships among the microservices by using the service running logs and the key performance indexes; the dynamic service dependency graph builder uses the service dependency relationship during deployment and the service dependency relationship during operation to dynamically construct a service dependency graph; the service dependency graph memory is used for storing a service dependency graph;
the abnormal propagation graph constructing module comprises an abnormal propagation graph constructor and an abnormal propagation graph memory, wherein the abnormal propagation graph constructor constructs an abnormal propagation graph by using the service dependency graph and the input abnormal set; the exception propagation map memory is used for storing an exception propagation map;
the abnormal root cause positioning module comprises an abnormal propagation path builder, an abnormal root cause calculator and an abnormal root cause reporter, wherein the abnormal propagation path builder uses an abnormal propagation graph to search all possible abnormal root cause nodes and build an abnormal propagation path; the abnormal root cause calculator uses the abnormal propagation path and calculates the abnormal root cause score according to the number of the paths, the weight on the upper side of the path and the abnormal occurrence time; and the abnormal root cause reporter sorts the abnormal according to the descending order of the scores and reports the abnormal propagation path corresponding to the abnormal root cause reporter.
CN202211470197.6A 2022-11-23 2022-11-23 Abnormal root cause positioning method and system based on dynamic service dependency graph Active CN115756929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211470197.6A CN115756929B (en) 2022-11-23 2022-11-23 Abnormal root cause positioning method and system based on dynamic service dependency graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211470197.6A CN115756929B (en) 2022-11-23 2022-11-23 Abnormal root cause positioning method and system based on dynamic service dependency graph

Publications (2)

Publication Number Publication Date
CN115756929A true CN115756929A (en) 2023-03-07
CN115756929B CN115756929B (en) 2023-06-02

Family

ID=85335430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211470197.6A Active CN115756929B (en) 2022-11-23 2022-11-23 Abnormal root cause positioning method and system based on dynamic service dependency graph

Country Status (1)

Country Link
CN (1) CN115756929B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116450399A (en) * 2023-06-13 2023-07-18 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN116820826A (en) * 2023-08-28 2023-09-29 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain
CN117792696A (en) * 2023-12-07 2024-03-29 北京邮电大学 Log anomaly detection and positioning method and device for distributed system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103606042A (en) * 2013-11-18 2014-02-26 南京理工大学 Service combination instance migration effectiveness judgment method based on dynamic dependency graph
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN112787841A (en) * 2019-11-11 2021-05-11 华为技术有限公司 Fault root cause positioning method and device and computer storage medium
US20220019495A1 (en) * 2020-07-14 2022-01-20 Microsoft Technology Licensing, Llc Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph
EP3951598A1 (en) * 2020-08-07 2022-02-09 NEC Laboratories Europe GmbH Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs
CN115118621A (en) * 2022-06-27 2022-09-27 浙江大学 Micro-service performance diagnosis method and system based on dependency graph
CN115278741A (en) * 2022-06-15 2022-11-01 清华大学 Fault diagnosis method and device based on multi-mode data dependency relationship

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103606042A (en) * 2013-11-18 2014-02-26 南京理工大学 Service combination instance migration effectiveness judgment method based on dynamic dependency graph
CN112787841A (en) * 2019-11-11 2021-05-11 华为技术有限公司 Fault root cause positioning method and device and computer storage medium
US20220019495A1 (en) * 2020-07-14 2022-01-20 Microsoft Technology Licensing, Llc Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph
EP3951598A1 (en) * 2020-08-07 2022-02-09 NEC Laboratories Europe GmbH Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN115278741A (en) * 2022-06-15 2022-11-01 清华大学 Fault diagnosis method and device based on multi-mode data dependency relationship
CN115118621A (en) * 2022-06-27 2022-09-27 浙江大学 Micro-service performance diagnosis method and system based on dependency graph

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116450399A (en) * 2023-06-13 2023-07-18 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN116450399B (en) * 2023-06-13 2023-08-22 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN116820826A (en) * 2023-08-28 2023-09-29 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain
CN116820826B (en) * 2023-08-28 2023-11-24 北京必示科技有限公司 Root cause positioning method, device, equipment and storage medium based on call chain
CN117792696A (en) * 2023-12-07 2024-03-29 北京邮电大学 Log anomaly detection and positioning method and device for distributed system

Also Published As

Publication number Publication date
CN115756929B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
CN115756929A (en) Abnormal root cause positioning method and system based on dynamic service dependency graph
Zou et al. Finding top-k maximal cliques in an uncertain graph
US20210286811A1 (en) Continuous cloud-scale query optimization and processing
Mishra et al. Sublinear time approximate clustering.
US8098585B2 (en) Ranking the importance of alerts for problem determination in large systems
CN113612749B (en) Intrusion behavior-oriented tracing data clustering method and device
US20240168963A1 (en) Mining patterns in a high-dimensional sparse feature space
US20090106174A1 (en) Methods, systems, and computer program products extracting network behavioral metrics and tracking network behavioral changes
Shimada et al. Class association rule mining with chi-squared test using genetic network programming
CN103336791B (en) Hadoop-based fast rough set attribute reduction method
US20240095117A1 (en) Recommendations for remedial actions
US9280409B2 (en) Method and system for single point of failure analysis and remediation
CN112769605A (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN114598539B (en) Root cause positioning method and device, storage medium and electronic equipment
US20160378634A1 (en) Automated validation of database index creation
CN114579407B (en) Causal relationship inspection and micro-service index prediction alarm method
CN111859047A (en) Fault solving method and device
Gao et al. Time Series Data Cleaning under Multi-Speed Constraints.
CN110598417B (en) Software vulnerability detection method based on graph mining
CN112905370A (en) Topological graph generation method, anomaly detection method, device, equipment and storage medium
CN114416573A (en) Defect analysis method, device, equipment and medium for application program
Gu et al. Improving the quality of web-based data imputation with crowd intervention
CN118214649A (en) Operation and maintenance fault quick positioning method based on network topology structure
Sozuer et al. A new approach for clustering alarm sequences in mobile operators
CN116777205A (en) Supply chain risk pre-judging method, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant