CN112698975A

CN112698975A - Fault root cause positioning method and system of micro-service architecture information system

Info

Publication number: CN112698975A
Application number: CN202011468424.2A
Authority: CN
Inventors: 王平; 潘宜城; 马萌
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-04-23
Anticipated expiration: 2040-12-14
Also published as: CN112698975B

Abstract

The invention discloses a fault root cause positioning method and a system of a micro-service architecture information system, which comprises the following steps: the system comprises an index data collection module, an abnormality detection module, a micro-service dependency graph construction module and a reverse tracking module; by establishing a dynamic association analysis method among micro services, designing a root cause positioning algorithm based on a fault propagation chain model, identifying the propagation process of related faults while positioning fault root cause services, improving the interpretability of fault positioning and diagnosis, being used in a micro service architecture information system, improving the accuracy of dynamic association modeling in the micro service architecture information system, improving the convenience of using a fault diagnosis tool of the micro service architecture information system by a micro service performance index data-driven method, and saving the time and energy for deployment.

Description

Fault root cause positioning method and system of micro-service architecture information system

Technical Field

The invention belongs to the technical field of information, relates to a fault diagnosis technology of an information system, and particularly relates to a fault root cause positioning method and system of a micro-service architecture information system.

Background

The fault diagnosis of the existing micro-service architecture information system mainly adopts a method for constructing a dependency graph of a micro-service, and related work comprises the following steps: ADD [1], Orion [2], MonitorRank [3], Sieve [4], Microsphere [5], CloudRanger [6 ]. Wherein Orion [2] diagnoses failures of services and instances of the system by constructing correlations by analyzing network traffic delay distributions among the services. MonitorRank [3] and Sieve [4] both use service call records and performance index data, the former uses correlation coefficients and second-order random walk to diagnose service failures, and the latter uses the Granger causal test [7] analysis method. ADD [1] analyzes the service association relationship by using an active disturbance and regression analysis method, and Microscope [5] analyzes the network traffic data, constructs the association relationship by using a PC algorithm [8], and performs fault diagnosis by using deep search. Similarly, CloudRanger [6] extracts the incidence relation in the performance index data of the service through a PC algorithm [8] and adopts second-order random walk to position the fault root cause.

The existing technology adopts a dependency graph method, and only static service dependency relations can be generated. Modern micro-service architecture information systems often use technologies including load balancing, automatic scaling and the like, the dependency relationship among services is in dynamic change, and the dynamic property is also reflected in the fault propagation process. The existing method is based on the assumption of static service dependence, and the dynamic property of the service dependence is not considered, so that the dynamic propagation process of the fault in the modern micro-service system cannot be detected. Meanwhile, the existing microservice fault root cause positioning algorithm can only position fault root cause service, and cannot find the specific propagation process of the fault in the microservice system, so that the interpretability is not enough.

Reference documents:

[1]Brown,G.Kar,and A.Keller,"An active approach to characterizing dynamic dependencies for problem determination in a distributed environment,"in 2001IEEE/IFIP International Symposium on Integrated Network Management Proceedings.Integrated Network Management VII.Integrated Management Strategies for the New Millennium(Cat.No.01EX470),2001,pp.377-390:IEEE.

[2]X.Chen,M.Zhang,Z.M.Mao,and P.Bahl,"Automating Network Application Dependency Discovery:Experiences,Limitations,and New Solutions,"in OSDI,2008,vol.8,pp.117-130.

[3]M.Kim,R.Sumbaly,and S.Shah,"Root cause detection in a service-oriented architecture,"ACM SIGMETRICS Performance Evaluation Review,vol.41,no.1,pp.93-104,2013.

[4]J.Thalheim,A.Rodrigues,I.E.Akkus,P.Bhatotia,R.Chen,B.Viswanath,L.Jiao,C.Fetzer,"Sieve:actionable insights from monitored metrics in distributed systems,"in Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference,2017:ACM,pp.14-27.

[5]J.Lin,P.Chen,and Z.Zheng,"Microscope:Pinpoint Performance Issues with Causal Graphs in Micro-service Environments,"in International Conference on Service-Oriented Computing,2018:pp.3-20.

[6]Wang,Ping,et al."Cloudranger:Root cause identification for cloud native systems."2018 18th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing(CCGRID).IEEE,2018.

[7]C.W.Granger,“Investigating causal relations by econometric models and cross-spectral methods,”Econometrica:Journal of the Econometric Society,pp.424–438,1969.

[8]P.Spirtes,C.N.Glymour,and R.Scheines,“Causation,prediction,and search”,MIT press,2000.

disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and a system for positioning a fault root cause of a micro-service architecture information system. The method adopts a new modeling method to establish a dynamic correlation analysis method between the micro services, and solves the problem that the service dependency relationship in the existing micro service architecture information system fault diagnosis technology can only be static; a root cause positioning algorithm based on a fault propagation chain model is designed, fault root cause service is positioned, meanwhile, a specific propagation process of related faults is provided, and interpretability of fault diagnosis is improved.

The method of the present invention may operate in an information system employing a microservice architecture. The system consists of one or more Linux servers, is connected in a network through a router, and is managed together by a Docker Swarm cluster. Wherein, the invention will deploy different micro-services through Docker. The deployed microservices are HTTP services realized based on Java, Python or Go and other languages, and can be accessed in an HTTP mode. And the micro services communicate with each other by means of HTTP API or message queue. The information system is provided with an index collection tool, and can acquire performance indexes of each micro service, such as request delay. The performance index data is input into the method of the invention to carry out fault root cause positioning, the root cause micro service causing the front-end micro service abnormity is found, the host running the micro service can be further positioned by utilizing the deployment information of the micro service, and whether the cause causing the micro service abnormity is the fault of a hardware level is judged by checking the state of the host (CPU occupancy rate, memory occupancy rate and disk read-write condition).

Aiming at the problem that the service dependency relationship in the fault diagnosis of the existing micro-service architecture information system can only be static, the invention provides a dynamic micro-service correlation analysis method based on Granger causal test and a sliding window, which is used for mining the dynamic dependency relationship between services from index data of micro-services, designing a micro-service fault root cause positioning algorithm based on a fault propagation chain, detecting the root cause of the fault of the micro-service architecture information system and generating an explanatory fault propagation chain.

For convenience, the following term definitions are used in the description of the present disclosure:

table 1 definition of terms

The Granger causal test is a probabilistic method of detecting whether causal associations exist between two time series, the calculation of which is shown below by way of example. Suppose that two nodes V in a given set of microservice nodes V are within an abnormal interval of collected data_x,v_yThe index sequences collected are marked as X and Y. Two linear regression models M were constructed_self,M_full：

Wherein M is_selfIs Y_t-1,…,Y_t-lagThe dependent variable is Y_tWherein M is_fullIs Y_t-1,…,Y_t-lag,X_t-1,…,X_t-lagThe dependent variable is Y_t. The difference between the two models is whether or not the microservice node v is added_xAs an independent variable of the regression model. The least square fitting of two models on the index sequences X and Y is carried out, and the square sum error of the models after fitting is calculated

Is recorded as SSE_self,SSE_full. If there is no causal association between the time series X, Y, it can be statistically demonstrated that:

will obey a parameter of (d)_full-d_self,T-d_full-1) F distribution. Therefore, the association relationship can be judged by performing hypothesis test on the F distribution. Here, the null hypothesis is no causal association, and the probability of establishment calculated by F distribution is p, so when p is less than the significance level α, the null hypothesis can be considered to be not established, that is, the microservice v is_x,v_yThere is an association relation v between_x→v_yI.e. fault from v_xIs propagated to v_yElse v_x,v_yThere is no fault association between them.

In consideration of the dynamic property of the micro-service association relationship, the invention expands the Granger causal test into a multi-round test on a sliding window to model the dynamic association relationship.

The technical scheme of the invention is as follows:

a root cause positioning method of a micro-service architecture information system is characterized in that a root cause positioning algorithm based on a fault propagation chain model is designed by establishing a dynamic correlation analysis method among micro-services, the propagation process of related faults is identified while fault root cause services are positioned, the interpretability of fault positioning and diagnosis is improved, and the method can be used in the micro-service architecture information system.

In specific implementation, the invention is applied to a micro-service architecture information system, which consists of one or more Linux servers, is connected in a network through a router, and is managed together by a Docker Swarm cluster. After the micro-service is deployed and operated in the system, the abnormal interval detection algorithm provided by the invention can detect the state of the micro-service in real time. When the micro service is found to have faults, the invention collects the performance index data of the micro service by using a log analysis tool or Prometheus and sends the performance index data to a server running a fault root cause positioning algorithm. The algorithm firstly constructs a service dependency graph of the micro-service architecture information system on the basis of Granger causal test and a sliding window, then restores a possible fault propagation chain through back tracking, and finally positions a fault root cause of the micro-service architecture information system and outputs the fault root cause to a terminal for operation and maintenance personnel to check.

The fault root cause positioning method of the micro-service architecture information system comprises the following steps:

A. micro-service performance index data of the micro-service architecture information system is collected, and the performance index data comprises request delay time sequence data.

The implementation method comprises two methods, one is a log extraction method. And when the micro service runs, the micro service sends the request log to the Docker management process for storage. The invention uses script to extract the delay information of each request, and averages the request delay within 1 second to obtain the request delay time sequence data of each microservice per second. The other is the Prometheus tool. The invention deploys a Prometheus collecting tool in the micro-service system, and collects the performance indexes of all micro-services according to a fixed sampling interval. The invention then derives these metric data for fault analysis via the Prometheus interface. In order to facilitate the later calculation, the invention carries out normalization processing on the output micro-service request delay time sequence.

B. And detecting abnormal intervals of the micro-services, and identifying whether the micro-service architecture information system is abnormal or not.

The abnormal interval detection adopts a method based on standard deviation, the abnormal degree of each micro service is measured and weighted and summed, so that the abnormal (degree) interval of the whole micro service architecture information system can be obtained, and when the abnormal interval exceeds a certain value (a set threshold value), the system is judged to be abnormal. For convenient analysis, the collected micro-service index data is recorded as a time sequence M_i(t) detecting an abnormal interval of the micro service by the following steps, as shown in fig. 1:

B1. calculating each microservice indicator in a sliding window L_wInner moving standard deviation sigma_i(t) indicating the degree of abnormality of the microservice.

B2. According to importance level lambda for all micro-services_iWeighting is performed, and the anomaly level of the whole system at the time t is calculated:

wherein S is_ab(t) is the anomaly level of the system, σ_i(t) denotes a microservice v_iAbnormal level of (A), λ_iFor micro-service v_iThe level of importance of.

B3. When the abnormal level S of the whole system_ab(t) exceeds a given threshold value theta_ab·N(θ_abIs a threshold value for detecting an abnormal interval, and N is the number of micro services), it is determined that the micro service system is in an abnormal state at that time, and fault diagnosis is required. Wherein the importance level lambda_iAnd theta_abFor a parameter which can be set as desired, λ_iThe value range is [0, + ∞]，θ_abThe value range is (0, 1)]. Calculating the time with the highest abnormal level in all the abnormal state time points, and recording the time as t_eIf the abnormal interval is the time interval [ t ] of the performance index data_e-L_pre:t_e+L_post]Wherein L is_pre，L_postThe size of the interval is represented, the value range is the range which does not exceed all available data, the data of the interval is used for a subsequent algorithm and is marked as micro service v_iAbnormal section data of

C. And constructing a micro service dependency graph of the micro service architecture information system. Comprising the following steps, as shown in fig. 2:

C1. and performing time sequence dynamic association analysis to obtain a dynamic association relation between the micro services.

The dynamic associations between each pair of microservices are first analyzed. Assuming the analysis target is the micro-service relationship side v_i→v_jThe abnormal section data obtained in step B3

At the minimum step length L (time length T)_bEnumerating all possible sub-intervals, denoted as sliding window s_b，e_b]，b＝1，…n_winThen, a vector C with a length T of 0 is initialized_ijDenotes v_i→v_jThe dynamic correlation curve of (1). For each sliding window s_b，e_b]Carrying out Granger causal test, and if the zero hypothesis probability p obtained by the test is less than the significance level alpha, considering that the micro-service v is carried out on the sliding window_i→v_jExists in the correlation relation of (A), and dynamically correlates the curve C_ijValue (C) over the sliding window interval_ij[s_b:e_b]) Increase by 1, otherwise do not make any calculation. After all pairs of microservices and all sliding windows have been processed, vector C_ijA dynamic correlation curve between the micro services is shown and further analysis is performed.

C2. And setting a self-adaptive threshold, judging whether fault association exists between the micro services, and generating a micro service dependency graph of the micro service system.

After obtaining the dynamic association relationship between the microservices, in order toAnd obtaining qualitative description of whether fault association exists between the micro services, and generating a specific association edge by adopting a thresholding method. For microservice v_iIn order to judge whether the micro service has correlation with all other micro services, the invention counts the data from the micro service v_iAll dynamic correlation curves C of_ijJ-1, …, N, where N is the number of microservices, and these statistics are recorded with a vector h of length N, i.e.:

h_j＝∑_tC_ij(t)

wherein C is_ij(t) is microservice v_iTo v_jThe dynamic correlation curve of (1). Then compute for microservice v_iAdaptive threshold τ of_i＝θ_eMax (h). For each side v_i→v_jJ is 1, …, N, if h_j≥τ_iThen the strength of the association on this edge is considered to be large enough, and the edge is added to the finally generated micro-service dependency graph G (V, E, W), where the edge weight W_ijIs set as h_j/τ_i。

D. And obtaining root cause causing the abnormality of the front-end micro-service by adopting a reverse tracking root cause analysis method, thereby finding out the fault root cause micro-service.

After acquiring the micro-service dependency graph of the micro-service architecture information system, the invention adopts a reverse tracking root cause analysis algorithm to score the abnormal degree of each micro-service, sets an abnormal score threshold value, and considers that the micro-service with the abnormal degree score higher than the threshold value is the micro-service causing the front-end micro-service v_feRoot cause of the abnormality. The score of the micro-service abnormal degree comprises path correlation strength and correlation coefficient correlation strength, and is calculated according to the following method:

D1. the path association strength. The path correlation strength measures the possibility that the micro-service causes the failure of the front-end micro-service through the dependent topology of the micro-service system, so the invention uses the front-end micro-service v which goes wrong_feThe back tracking algorithm comprises the following specific steps:

step 1: front-end micro-service v with fault_feAs a destination, in the microservice architecture systemPerforming reverse breadth-first search on a service dependency graph G (V, E, W) (V is a micro-service node set, E is an associated edge set between micro-services, and W is a weight of an associated edge between the micro-services) to obtain a series of paths possibly representing a fault propagation process in the micro-service system, namely a fault propagation chain P_i. To avoid search space explosion and cyclic search, the present invention limits each service to occur in each path a maximum of 1 time, while limiting the number of fault propagation chains generated to within 10000, or to take other user-selected values.

Step 2: the probability of existence of each fault propagation chain is estimated. E.g. fault propagation chain P_i＝{i₁→…→i_nIn the present invention, the harmonic mean value is used to average the weight of the edge on the fault propagation chain, that is:

wherein,

propagating the chain upper edge for the fault

N is the length of the fault propagation chain.

And step 3: propagate all failures chain P_iSorting according to the existing probability from large to small, and selecting the top k fault propagation chains P after sorting_r1，P_r2，…，P_rkCounting n therein_leadEach leading microservice, counting the number of occurrences of each leading microservice, and dividing by kn_leadAs the path correlation strength S of the leading microservice_path(v_i)。

D2. The correlation coefficient correlates the strength. The correlation strength is obtained by calculating the front-end micro-service v of each micro-service and the fault_feThe absolute correlation coefficient of (a) is obtained, i.e.:

wherein

Representing microservices v_iThe abnormal-interval index data of (1),

representing front-end micro-services v_feThe abnormal section index data of (1).

D3. And calculating the micro-service abnormal degree score. The degree of anomaly score for each microservice is derived from the average path correlation strength and correlation coefficient correlation strength, i.e., c_pathS_path(v_i)+c_corrS_corr(v_i). Finally, the invention sorts the micro-services from big to small according to the abnormal degree scores of the micro-services, and generates a micro-service list v_γ1，v_γ2，…，v_γNThe candidate fault root cause is served, and therefore fault root cause positioning is achieved. The candidate fault root cause service and the generated fault propagation chain may be used to assist in fault diagnosis. The deployment information of the micro-service can be further utilized to locate the host of the fault caused by the micro-service operation, and whether the fault of the hardware layer causes the abnormity of the micro-service can be judged by checking the state (CPU occupancy rate, memory occupancy rate and disk read-write condition) of the host.

According to the fault propagation chain associated with the fault root cause micro-service, the problem of the micro-service architecture information system can be more accurately judged. If the microservices on the fault propagation chain are all deployed on a host or in a network, the fault may be a hardware problem of the host or a problem of network equipment (switch, router), so that operation and maintenance personnel can check the hardware equipment.

The invention also provides a fault root cause positioning system of the micro-service architecture information system, which comprises the following modules: the system comprises an index data collection module, an abnormality detection module, a micro-service dependency graph construction module and a reverse tracking module. The system architecture is shown in fig. 3, and the functions of the modules are as follows:

index data collection module: the index data collection module is used for butting a micro-service architecture information system which needs fault diagnosis, obtaining a request delay index of each micro-service by analyzing a micro-service call log or adopting a Prometheus monitoring tool, and forming corresponding time sequence data for subsequent analysis.

An anomaly detection module: the abnormity detection module analyzes the request delay index data of the micro service and detects whether the micro service architecture information system is in an abnormal state. When the micro-service architecture system is detected to be in an abnormal state, the module collects micro-service index data reflecting the abnormality and forms abnormal interval data, and the abnormal interval data is provided for a subsequent module to carry out further fault analysis.

The micro-service dependency graph building module: the module starts analysis when the micro-service architecture information system is abnormal, constructs a micro-service dependency graph of the micro-service architecture information system when a fault occurs by performing dynamic correlation analysis on the micro-services one by one, and restores a possible fault propagation mode for a follow-up module to perform fine-grained fault root cause positioning and fault chain extraction.

A reverse tracking module: the reverse tracking module starts analysis when the micro-service architecture information system is abnormal, reverse path search is carried out on the generated micro-service dependency graph by taking the abnormal front-end micro-service as an entrance, the probability of each path is estimated, a high-possibility fault propagation chain is formed, the root cause probability of the micro-service is estimated by combining the index data correlation coefficient of the micro-service, and a final fault root cause list is generated.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a fault root cause positioning method and a fault root cause positioning system of a micro service architecture information system, which improve the accuracy of dynamic association modeling in the micro service architecture information system by establishing a dynamic association analysis method, improve the use simplicity of a fault diagnosis tool of the micro service architecture information system by a method driven by micro service performance index data, save the deployment time and energy, improve the interpretability of fault diagnosis of the micro service architecture information system by a fault propagation chain generated by fault diagnosis and improve the accuracy of fault diagnosis of the micro service architecture information system.

Drawings

FIG. 1 is a schematic flow chart of an abnormal interval detection algorithm in the present invention;

wherein sigma_iRepresenting microservices v_iMoving standard deviation of (a) ("λ_iFor its importance weight, N is the number of microservices of the microservice architecture information system, θ_abIs a threshold parameter for abnormal interval detection.

Fig. 2 is a schematic flow chart of the construction of the microservice dependency graph in the present invention.

FIG. 3 is a schematic diagram of a fault root cause location system of the present invention.

Fig. 4 is a graph of 4 microservice request latency data after normalization in an embodiment of the invention.

FIG. 5 is a graph of system anomaly score in an embodiment of the present invention;

wherein L is_pre，L_postThe size of the abnormal interval is shown together, N is the number of micro-services in the embodiment, and theta_abIs a threshold parameter of the abnormal interval detection algorithm.

FIG. 6 is a diagram of the micro-service architecture information system dependency in an embodiment of the present invention, wherein the numbered circles are micro-services.

FIG. 7 is a fault propagation chain and its dynamic association curve in an embodiment of the present invention, where the number i and API No. i both represent the microservice v_iAnd the values of the dynamic correlation curve are normalized.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a fault root cause positioning method and a fault root cause positioning system of a micro-service architecture information system.

In specific implementation, the invention is applied to a micro-service architecture information system, which consists of one or more Linux servers, is connected in a network through a router, and is managed together by a Docker Swarm cluster. In implementation, a distributed software containing a plurality of micro-services is deployed in the micro-service architecture information system, and request log data of the micro-services is collected. The invention establishes a dynamic micro-service correlation analysis method based on Granger causal test and a sliding window, mines the dynamic dependency relationship between micro-services from index data of the micro-services, detects the root cause of the micro-service architecture information system fault through a micro-service fault root cause positioning algorithm based on a fault propagation chain, and generates an explanatory fault propagation chain.

The method comprises the following steps:

A. acquiring micro-service performance index data of a micro-service architecture information system; the performance indicator data includes request delay time series data; in the embodiment, a method for extracting performance index data from micro-service request log data is adopted.

B. Detecting to obtain micro-service abnormal interval data, and identifying whether the micro-service architecture information system is abnormal or not; microservice v_iIs recorded as an abnormal interval

C. Constructing a micro-service dependency graph of a micro-service architecture information system; the method comprises the following steps:

C1. analyzing the time sequence dynamic association to obtain a dynamic association relation between the micro services;

v_i、v_jtwo microservices; for micro service relationship edge v_i→v_jObtaining abnormal interval data

The time length is T and is in accordance with the minimum step length L_bEnumerating all possible sub-intervals, denoted as sliding window s_b，e_b]，i＝1，…n_win(ii) a Initializing a vector C of length T and value 0_ijDenotes v_i→v_jThe dynamic correlation curve of (2);

for each sliding window s_i，e_i]Performing Granger causal test, and if the zero hypothesis probability p obtained by the test is less than the significance level alpha, micro-service v on the sliding window_i→v_jExists in the correlation relation of (A), and dynamically correlates the curve C_ijValue (C) over the sliding window interval_ij[s_i:e_i]) Increasing 1, otherwise, not making any calculation;

when all paired microservices and all sliding windows are processed, obtaining a vector C_ijRepresenting a dynamic association curve between microservices;

C2. setting a self-adaptive threshold value for judging whether fault association exists between the micro services and generating a micro service dependency graph of the micro service system; the method comprises the following steps:

C21. calculating an adaptive threshold;

for microservice v_iBy statistics from microservices v_iAll dynamic correlation curves C of_ijJ is 1, …, N, where N is the number of microservices, and a statistic is recorded with a vector h of length N, whose jth component is calculated as:

wherein, C_ij(t) is microservice v_iTo v_jThe dynamic correlation curve of (2);

then compute microservice v_iAdaptive threshold τ_i＝θ_e·max(h)；

C22. Generating association edges among the micro services by adopting a thresholding method, and generating a micro service dependency graph;

for each side v_i→v_jJ is 1, …, N, if h_j≥τ_iThen the strength of the association of the edge is large enough to add the edge to the final generated oneIn the micro-service dependency graph G (V, E, W), where V is the micro-service node set, E is the associated edge set between micro-services, and W is the weight of the associated edge between micro-services, where W is_ijIs set as h_j/τ_i；

D. Obtaining root cause causing the abnormality of the front-end micro-service by adopting a reverse tracking root cause analysis method, and finding out fault root cause service;

scoring the abnormal degree of each micro-service by adopting a back-tracking root cause analysis algorithm, wherein the scoring comprises path correlation strength and correlation coefficient correlation strength; setting an abnormal grade threshold value, and considering that the micro-service with the abnormal degree grade higher than the threshold value is the micro-service v causing the front end_feA root cause of the abnormality; the score of the micro-service abnormal degree is calculated according to the following method:

D1. front-end micro-service v using slave failures_feCalculating the path association strength by using a reverse tracking algorithm;

the path correlation strength is used for measuring the possibility that the micro service causes the failure of the front-end micro service through the dependent topology of the micro service system; the specific calculation comprises the following steps:

step 1: front-end micro-service v with fault_feFor the end point, a reverse breadth-first search is carried out on a micro service dependency graph G (V, E, W) of the micro service architecture system to obtain a series of paths which possibly represent the propagation process of the fault in the micro service system, namely a fault propagation chain P_i，P_i＝{i₁→…→i_n}；

Step 2: estimating the existence probability of each fault propagation chain;

the existence probability refers to the probability that the fault employs this fault propagation chain. Since the actual fault cannot be obtained exactly in which way, it can be estimated approximately how the fault propagates. Similarly, the weight of an edge of the service dependency graph represents the probability of the fault propagating on the edge, i.e., the fault of one micro-service has a certain probability of affecting the micro-services adjacent to the fault. The invention uses harmonic mean values to average the weights of all edges on a fault propagation chain, thereby obtaining an estimate of the overall probability.

Mean fault propagation chain P using harmonic means_iThe weight of the above edge is expressed as:

wherein,

propagating the chain upper edge for the fault

N is the length of the fault propagation chain;

and step 3: propagate all failures chain P_iSorting according to the existing probability from large to small, and selecting the top k fault propagation chains P after sorting_r1，P_r2,…,P_rkCounting n therein_leadEach leading microservice, counting the number of occurrences of each leading microservice, and dividing by kn_leadThe strength of the path association of the leading microservice is denoted as S_path(v_i)；

D2. Calculating correlation coefficient correlation strength;

correlation coefficient correlation strength is calculated by calculating front-end micro-service v of each micro-service and fault_feThe absolute correlation coefficient of (a) is obtained, i.e.:

wherein,

representing microservices v_iThe abnormal-interval index data of (1),

representing front-end micro-services v_feThe abnormal section index data of (1);

D3. calculating a micro-service abnormal degree score:the average path correlation strength and the correlation coefficient correlation strength are added, namely c_pathS_path(v_i)+c_corrS_corr(v_i)；

And then the micro-services are ranked from big to small according to the abnormal degree scores of the micro-services, and a generated micro-service list v_γ1,v_γ2,…,v_γNThe candidate fault root cause is served, and therefore fault root cause positioning is achieved.

The invention discloses a system for realizing a fault root cause positioning method of a micro-service architecture information system, which comprises the following modules: the system comprises an index data collection module, an abnormality detection module, a micro-service dependency graph construction module and a reverse tracking module; wherein: the index data collection module is used for butting a micro-service architecture information system which needs fault diagnosis, acquiring a request delay index of each micro-service and forming corresponding time sequence data; the anomaly detection module is used for analyzing the request delay index data of the micro-service and detecting whether the micro-service architecture information system is in an abnormal state; when the micro-service architecture system is detected to be in an abnormal state, collecting micro-service index data reflecting the abnormality, forming abnormal interval data, and providing the abnormal interval data for a subsequent module for further fault analysis; the micro-service dependency graph construction module is used for analyzing when the micro-service architecture information system is abnormal, constructing a micro-service dependency graph of the micro-service architecture information system when a fault occurs by performing dynamic association analysis on the micro-services in pairs, and restoring a possible fault propagation mode for fine-grained fault root cause positioning and fault chain extraction of a subsequent module; the reverse tracking module is used for analyzing the generated micro-service dependency graph, carrying out reverse path search by taking the abnormal front-end micro-service as an entrance, estimating the probability of each path, forming a high-possibility fault propagation chain, estimating the root cause probability of the micro-service by combining the index data correlation coefficient of the micro-service, and generating a final fault root cause list.

The following shows the process of fault diagnosis of the present invention on a commercial microservice system containing 33 microservices.

According to step a, the present invention obtains request delay information of each microservice on the microservice system by using request log data, and collects data with a length of 7199 seconds in total, and fig. 4 shows the request delay data of 4 microservices after normalization.

According to step B, the invention calculates the degree of abnormality of the individual microservices and the degree of abnormality of the system as a whole, where the parameter L_wIs 50 seconds, lambda_iTake 1.0, theta_abTake 0.3. FIG. 5 shows the anomaly level of the microservice system in 7199 seconds as a whole, and the time point of 4653 as the anomaly interval of the microservice system is finally selected through calculation and is according to L_pre,L_post0,280 output failure intervals [4653,4933 ]]。

According to the step C, the dynamic association analysis based on the sliding window is carried out on the data of the fault section generated in the step B, and finally the association diagram of the micro-service system is generated. In the present example, the parameter L_bIs 70, alpha is 0.1, theta_eFig. 6 shows a dependency graph of the microservice system, 0.5.

According to the step D, the invention carries out reverse path tracking according to the dependence graph of the micro-service system, generates a series of candidate fault propagation chains and finally provides a service list of the fault root. In the limiting search, the number of fault propagation chains is 10000, the parameter k is 50, and n is_leadIs 3, c_path,c_corrFor 1.0, the top 10 failure propagation chains given in this example and the corresponding probabilities are shown in table 2. The resulting fault root cause service list and the corresponding abnormality degree score are shown in table 3.

Chain of propagation of faults	Estimating presence probability
		[14,21,5,17,27]	0.3976
[14,21,5,17,16,20,11,29,2,12,7]	0.3645
		[14,21,5,17,16,20,25,29,2,12,7]	0.3645
[14,21,5,17,16,20,25,6,31,30,33]	0.3619
		[14,21,5,17,16,20,11,6,31,30,12]	0.3619
[14,21,5,17,16,20,11,6,31,30,33]	0.3619
		[14,21,5,17,16,20,25,6,31,30,12]	0.3619
[14,21,5,17,16,20,11,6,30,12,7]	0.3568
		[14,21,5,17,16,20,25,6,30,12,7]	0.3568
[14,21,5,17,28,30,12,7,19,29,2]	0.3551

Table 2 fault propagation chain in an example system

Fault servicing	Degree of abnormality scoring
		30	0.5618049
31	0.5271997
		6	0.4575075
28	0.4018228
		33	0.3257960
12	0.3107594
		7	0.2501501
29	0.2021703
		19	0.1946196
3	0.1944521
		27	0.1828436
17	0.0949014
		5	0.0850567
2	0.0765803

Table 3 list of fault root cause service results in an exemplary system

In order to demonstrate the capability of the dynamic correlation curve proposed by the present invention to describe fault propagation, fig. 7 shows the dynamic correlation curve on a fault propagation chain. Propagation process (v) in micro-service architecture information system with fault₂₇→v₂₂→v₂₁→v₁₄) It can be seen that the dynamic correlation curve also reflects the trend of gradual movement of the fault in the time dimension.

The deployment information of the micro-service is utilized to further position the host of the fault caused by the micro-service operation, and whether the fault of the hardware layer causes the abnormity of the micro-service can be judged by checking the state of the host (CPU occupancy rate, memory occupancy rate and disk read-write condition). If the hardware of the micro service operation has no problem, the parameters of the container management platform Docker Swarm for deploying the micro service are further checked to determine whether the error configuration causing the micro service to fail exists. If neither of the two is problematic, the fault is considered to be present in the software code of the microservice, and further analysis of the software code is required. The problem of the micro-service architecture information system can be more accurately judged by combining a fault propagation chain associated with the fault root cause micro-service. If the microservices on the fault propagation chain are all deployed on one host or in one network, the fault may be a hardware problem of the host or a problem of the network device (switch, router). The operation and maintenance personnel can then inspect these hardware devices.

After verification with enterprise operation and maintenance personnel of the micro-service system, the fault root cause diagnosis of the system finally achieves 100% accuracy, namely the actual 4 fault root cause services are all in the first 4 of a given fault service result list.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A fault root cause positioning method of a micro-service architecture information system is characterized in that a root cause positioning algorithm based on a fault propagation chain model is designed by establishing a dynamic correlation analysis method among micro-services, the propagation process of related faults is identified while fault root cause services are positioned, a fault propagation chain is generated, the interpretability of fault positioning and diagnosis is improved, and the method can be used in the micro-service architecture information system; the method comprises the following steps:

A. acquiring micro-service performance index data of a micro-service architecture information system; the performance indicator data includes request delay time series data;

B. detecting to obtain micro-service abnormal interval data, and identifying whether the micro-service architecture information system is abnormal or not; microservice v_iIs recorded as abnormal interval data

The time length is T and is in accordance with the minimum step length L_bEnumerating all possible sub-intervals, denoted as sliding window s_b,e_b],b＝1,…n_winWherein s is_b,e_bRespectively a start point and an end point of the sliding window, n_winA total number of sliding windows that is an enumeration; initializing a vector C of length T and value 0_ijDenotes v_i→v_jThe dynamic correlation curve of (2);

for each sliding window s_b,e_b]Performing Granger causal test, and if the zero hypothesis probability p obtained by the test is less than the significance level alpha, micro-service v on the sliding window_i→v_jExists in the correlation relation of (A), and dynamically correlates the curve C_ijValue (C) over the sliding window interval_ij[s_b:e_b]) Increasing 1, otherwise, not making any calculation;

C2. setting a self-adaptive threshold value for judging whether fault association exists between the micro services and generating a micro service dependency graph of the micro service system;

C21. adopting a thresholding method to generate an association edge between the micro services;

for microservice v_iBy statistics from microservices v_iAll dynamic correlation curves C of_ijJ is 1, …, N is the number of microservices and a statistic is recorded with a vector h of length N, expressed as:

h_j＝∑_tC_ij(t)

wherein, C_ij(t) is microservice v_iTo v_jThe dynamic correlation curve of (2);

C22. computing for microservices v_iAdaptive threshold τ of_i＝θ_eMax (h), judging the correlation strength of the edges, and generating a micro-service dependency graph G;

for each side v_i→v_jJ is 1, …, N, ifh_j≥τ_iThe strength of the association of the edge is large enough to add the edge to the finally generated micro-service dependency graph G (V, E, W), where V is the set of micro-service nodes, E is the set of associated edges between micro-services, W is the weight of the associated edges between micro-services, and W is_ijIs set as h_j/τ_i；

Step 2: estimating the existence probability of each fault propagation chain;

wherein,

propagating the chain upper edge for the fault

N is the length of the fault propagation chain;

and step 3: propagate all failures chain P_iSorting according to the existing probability from large to small, and selecting the top k fault propagation chains P after sorting_r1，P_r2，…，P_rkIn which P is_rkFor the rk th fault propagation chain, counting n in the rk th fault propagation chain_leadEach leading microservice, counting the number of occurrences of each leading microservice, and dividing by kn_leadThe strength of the path association of the leading microservice is denoted as S_path(v_i)；

D2. Calculating correlation coefficient correlation strength;

wherein,

representing microservices v_iThe abnormal-interval index data of (1),

D3. calculating a micro-service abnormal degree score: adding the path correlation strength and the correlation coefficient correlation strength, namely c_pathS_path(v_i)+c_corrS_corr(v_i)；

2. The method for locating the fault root cause of the microservice architecture information system of claim 1, wherein each microservice in the microservice architecture information system is configured in a container by a Docker, different microservices are communicated by means of an HTTP API or a message queue, and a tool for collecting performance index data is provided for acquiring the performance index of each microservice according to a user request log or active sampling.

3. The method for locating the fault root cause of the microservice architecture information system as claimed in claim 1, wherein the method for obtaining microservice performance indicator data of the microservice architecture information system comprises:

extracting delay information of each access request from a request log of the micro-service architecture information system, and averaging the request delay within 1 second to obtain request delay time sequence data of each micro-service per second;

or a Prometheus tool is adopted to directly obtain the request delay data per second from the microservice of the microservice architecture information system and construct a request delay time sequence;

and then, normalizing the acquired and output micro-service request delay time sequence.

4. The method as claimed in claim 1, wherein the collected microservice indicator data is recorded as a time series M_i(t) detecting an abnormal interval in which the micro-service is obtained by:

B1. calculating each microservice indicator in a sliding window L_wInner moving standard deviation sigma_i(t) indicating the degree of abnormality of the micro-service;

wherein S is_ab(t) is the abnormal level of the system; sigma_i(t) denotes a microservice v_iAbnormal level of (A), λ_iFor micro-service v_iOf importance, λ_iThe value range is [0, + ∞]；

B3. When the abnormal level S of the system_ab(t) exceeds a given threshold value theta_abN, where θ is_abThreshold value for abnormal section detection, θ_abThe value range is (0, 1)](ii) a N is the number of micro-services; judging that the micro-service system is in an abnormal state at the moment and needing fault diagnosis;

calculating the time with the highest abnormal level in all the abnormal state time points, and recording the time as t_eIf the abnormal interval is the time interval [ t ] of the performance index data_e-L_pre:t_e+L_post]Wherein L is_pre,L_postThe size of the interval is represented, the value range is not more than the range of all available data and is marked as abnormal interval data

5. A system for implementing the method for locating a fault root cause of the micro-service architecture information system of claim 1, comprising the following modules: the system comprises an index data collection module, an abnormality detection module, a micro-service dependency graph construction module and a reverse tracking module; wherein:

the index data collection module is used for butting a micro-service architecture information system which needs fault diagnosis, acquiring a request delay index of each micro-service and forming corresponding time sequence data;

the anomaly detection module is used for analyzing the request delay index data of the micro-service and detecting whether the micro-service architecture information system is in an abnormal state; when the micro-service architecture system is detected to be in an abnormal state, collecting micro-service index data reflecting the abnormality, forming abnormal interval data, and providing the abnormal interval data for a subsequent module for further fault analysis;

the micro-service dependency graph construction module is used for analyzing when the micro-service architecture information system is abnormal, constructing a micro-service dependency graph of the micro-service architecture information system when a fault occurs by performing dynamic association analysis on the micro-services in pairs, and restoring a possible fault propagation mode for fine-grained fault root cause positioning and fault chain extraction of a subsequent module;

the reverse tracking module is used for analyzing the generated micro-service dependency graph, carrying out reverse path search by taking the abnormal front-end micro-service as an entrance, estimating the probability of each path, forming a high-possibility fault propagation chain, estimating the root cause probability of the micro-service by combining the index data correlation coefficient of the micro-service, and generating a final fault root cause list.

6. The system of claim 5, wherein the index data collection module is further configured to obtain the request delay index for each microservice by analyzing a microservice call log or by using a Prometheus monitoring tool.