CN112698975A - Fault root cause positioning method and system of micro-service architecture information system - Google Patents

Fault root cause positioning method and system of micro-service architecture information system Download PDF

Info

Publication number
CN112698975A
CN112698975A CN202011468424.2A CN202011468424A CN112698975A CN 112698975 A CN112698975 A CN 112698975A CN 202011468424 A CN202011468424 A CN 202011468424A CN 112698975 A CN112698975 A CN 112698975A
Authority
CN
China
Prior art keywords
micro
service
fault
abnormal
root cause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011468424.2A
Other languages
Chinese (zh)
Other versions
CN112698975B (en
Inventor
王平
潘宜城
马萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202011468424.2A priority Critical patent/CN112698975B/en
Publication of CN112698975A publication Critical patent/CN112698975A/en
Application granted granted Critical
Publication of CN112698975B publication Critical patent/CN112698975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Operations Research (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a fault root cause positioning method and a system of a micro-service architecture information system, which comprises the following steps: the system comprises an index data collection module, an abnormality detection module, a micro-service dependency graph construction module and a reverse tracking module; by establishing a dynamic association analysis method among micro services, designing a root cause positioning algorithm based on a fault propagation chain model, identifying the propagation process of related faults while positioning fault root cause services, improving the interpretability of fault positioning and diagnosis, being used in a micro service architecture information system, improving the accuracy of dynamic association modeling in the micro service architecture information system, improving the convenience of using a fault diagnosis tool of the micro service architecture information system by a micro service performance index data-driven method, and saving the time and energy for deployment.

Description

Fault root cause positioning method and system of micro-service architecture information system
Technical Field
The invention belongs to the technical field of information, relates to a fault diagnosis technology of an information system, and particularly relates to a fault root cause positioning method and system of a micro-service architecture information system.
Background
The fault diagnosis of the existing micro-service architecture information system mainly adopts a method for constructing a dependency graph of a micro-service, and related work comprises the following steps: ADD [1], Orion [2], MonitorRank [3], Sieve [4], Microsphere [5], CloudRanger [6 ]. Wherein Orion [2] diagnoses failures of services and instances of the system by constructing correlations by analyzing network traffic delay distributions among the services. MonitorRank [3] and Sieve [4] both use service call records and performance index data, the former uses correlation coefficients and second-order random walk to diagnose service failures, and the latter uses the Granger causal test [7] analysis method. ADD [1] analyzes the service association relationship by using an active disturbance and regression analysis method, and Microscope [5] analyzes the network traffic data, constructs the association relationship by using a PC algorithm [8], and performs fault diagnosis by using deep search. Similarly, CloudRanger [6] extracts the incidence relation in the performance index data of the service through a PC algorithm [8] and adopts second-order random walk to position the fault root cause.
The existing technology adopts a dependency graph method, and only static service dependency relations can be generated. Modern micro-service architecture information systems often use technologies including load balancing, automatic scaling and the like, the dependency relationship among services is in dynamic change, and the dynamic property is also reflected in the fault propagation process. The existing method is based on the assumption of static service dependence, and the dynamic property of the service dependence is not considered, so that the dynamic propagation process of the fault in the modern micro-service system cannot be detected. Meanwhile, the existing microservice fault root cause positioning algorithm can only position fault root cause service, and cannot find the specific propagation process of the fault in the microservice system, so that the interpretability is not enough.
Reference documents:
[1]Brown,G.Kar,and A.Keller,"An active approach to characterizing dynamic dependencies for problem determination in a distributed environment,"in 2001IEEE/IFIP International Symposium on Integrated Network Management Proceedings.Integrated Network Management VII.Integrated Management Strategies for the New Millennium(Cat.No.01EX470),2001,pp.377-390:IEEE.
[2]X.Chen,M.Zhang,Z.M.Mao,and P.Bahl,"Automating Network Application Dependency Discovery:Experiences,Limitations,and New Solutions,"in OSDI,2008,vol.8,pp.117-130.
[3]M.Kim,R.Sumbaly,and S.Shah,"Root cause detection in a service-oriented architecture,"ACM SIGMETRICS Performance Evaluation Review,vol.41,no.1,pp.93-104,2013.
[4]J.Thalheim,A.Rodrigues,I.E.Akkus,P.Bhatotia,R.Chen,B.Viswanath,L.Jiao,C.Fetzer,"Sieve:actionable insights from monitored metrics in distributed systems,"in Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference,2017:ACM,pp.14-27.
[5]J.Lin,P.Chen,and Z.Zheng,"Microscope:Pinpoint Performance Issues with Causal Graphs in Micro-service Environments,"in International Conference on Service-Oriented Computing,2018:pp.3-20.
[6]Wang,Ping,et al."Cloudranger:Root cause identification for cloud native systems."2018 18th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing(CCGRID).IEEE,2018.
[7]C.W.Granger,“Investigating causal relations by econometric models and cross-spectral methods,”Econometrica:Journal of the Econometric Society,pp.424–438,1969.
[8]P.Spirtes,C.N.Glymour,and R.Scheines,“Causation,prediction,and search”,MIT press,2000.
disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method and a system for positioning a fault root cause of a micro-service architecture information system. The method adopts a new modeling method to establish a dynamic correlation analysis method between the micro services, and solves the problem that the service dependency relationship in the existing micro service architecture information system fault diagnosis technology can only be static; a root cause positioning algorithm based on a fault propagation chain model is designed, fault root cause service is positioned, meanwhile, a specific propagation process of related faults is provided, and interpretability of fault diagnosis is improved.
The method of the present invention may operate in an information system employing a microservice architecture. The system consists of one or more Linux servers, is connected in a network through a router, and is managed together by a Docker Swarm cluster. Wherein, the invention will deploy different micro-services through Docker. The deployed microservices are HTTP services realized based on Java, Python or Go and other languages, and can be accessed in an HTTP mode. And the micro services communicate with each other by means of HTTP API or message queue. The information system is provided with an index collection tool, and can acquire performance indexes of each micro service, such as request delay. The performance index data is input into the method of the invention to carry out fault root cause positioning, the root cause micro service causing the front-end micro service abnormity is found, the host running the micro service can be further positioned by utilizing the deployment information of the micro service, and whether the cause causing the micro service abnormity is the fault of a hardware level is judged by checking the state of the host (CPU occupancy rate, memory occupancy rate and disk read-write condition).
Aiming at the problem that the service dependency relationship in the fault diagnosis of the existing micro-service architecture information system can only be static, the invention provides a dynamic micro-service correlation analysis method based on Granger causal test and a sliding window, which is used for mining the dynamic dependency relationship between services from index data of micro-services, designing a micro-service fault root cause positioning algorithm based on a fault propagation chain, detecting the root cause of the fault of the micro-service architecture information system and generating an explanatory fault propagation chain.
For convenience, the following term definitions are used in the description of the present disclosure:
table 1 definition of terms
Figure BDA0002833848170000031
The Granger causal test is a probabilistic method of detecting whether causal associations exist between two time series, the calculation of which is shown below by way of example. Suppose that two nodes V in a given set of microservice nodes V are within an abnormal interval of collected datax,vyThe index sequences collected are marked as X and Y. Two linear regression models M were constructedself,Mfull
Figure BDA0002833848170000041
Figure BDA0002833848170000042
Wherein M isselfIs Yt-1,…,Yt-lagThe dependent variable is YtWherein M isfullIs Yt-1,…,Yt-lag,Xt-1,…,Xt-lagThe dependent variable is Yt. The difference between the two models is whether or not the microservice node v is addedxAs an independent variable of the regression model. The least square fitting of two models on the index sequences X and Y is carried out, and the square sum error of the models after fitting is calculated
Figure BDA0002833848170000043
Is recorded as SSEself,SSEfull. If there is no causal association between the time series X, Y, it can be statistically demonstrated that:
Figure BDA0002833848170000044
will obey a parameter of (d)full-dself,T-dfull-1) F distribution. Therefore, the association relationship can be judged by performing hypothesis test on the F distribution. Here, the null hypothesis is no causal association, and the probability of establishment calculated by F distribution is p, so when p is less than the significance level α, the null hypothesis can be considered to be not established, that is, the microservice v isx,vyThere is an association relation v betweenx→vyI.e. fault from vxIs propagated to vyElse vx,vyThere is no fault association between them.
In consideration of the dynamic property of the micro-service association relationship, the invention expands the Granger causal test into a multi-round test on a sliding window to model the dynamic association relationship.
The technical scheme of the invention is as follows:
a root cause positioning method of a micro-service architecture information system is characterized in that a root cause positioning algorithm based on a fault propagation chain model is designed by establishing a dynamic correlation analysis method among micro-services, the propagation process of related faults is identified while fault root cause services are positioned, the interpretability of fault positioning and diagnosis is improved, and the method can be used in the micro-service architecture information system.
In specific implementation, the invention is applied to a micro-service architecture information system, which consists of one or more Linux servers, is connected in a network through a router, and is managed together by a Docker Swarm cluster. After the micro-service is deployed and operated in the system, the abnormal interval detection algorithm provided by the invention can detect the state of the micro-service in real time. When the micro service is found to have faults, the invention collects the performance index data of the micro service by using a log analysis tool or Prometheus and sends the performance index data to a server running a fault root cause positioning algorithm. The algorithm firstly constructs a service dependency graph of the micro-service architecture information system on the basis of Granger causal test and a sliding window, then restores a possible fault propagation chain through back tracking, and finally positions a fault root cause of the micro-service architecture information system and outputs the fault root cause to a terminal for operation and maintenance personnel to check.
The fault root cause positioning method of the micro-service architecture information system comprises the following steps:
A. micro-service performance index data of the micro-service architecture information system is collected, and the performance index data comprises request delay time sequence data.
The implementation method comprises two methods, one is a log extraction method. And when the micro service runs, the micro service sends the request log to the Docker management process for storage. The invention uses script to extract the delay information of each request, and averages the request delay within 1 second to obtain the request delay time sequence data of each microservice per second. The other is the Prometheus tool. The invention deploys a Prometheus collecting tool in the micro-service system, and collects the performance indexes of all micro-services according to a fixed sampling interval. The invention then derives these metric data for fault analysis via the Prometheus interface. In order to facilitate the later calculation, the invention carries out normalization processing on the output micro-service request delay time sequence.
B. And detecting abnormal intervals of the micro-services, and identifying whether the micro-service architecture information system is abnormal or not.
The abnormal interval detection adopts a method based on standard deviation, the abnormal degree of each micro service is measured and weighted and summed, so that the abnormal (degree) interval of the whole micro service architecture information system can be obtained, and when the abnormal interval exceeds a certain value (a set threshold value), the system is judged to be abnormal. For convenient analysis, the collected micro-service index data is recorded as a time sequence Mi(t) detecting an abnormal interval of the micro service by the following steps, as shown in fig. 1:
B1. calculating each microservice indicator in a sliding window LwInner moving standard deviation sigmai(t) indicating the degree of abnormality of the microservice.
B2. According to importance level lambda for all micro-servicesiWeighting is performed, and the anomaly level of the whole system at the time t is calculated:
Figure BDA0002833848170000051
wherein S isab(t) is the anomaly level of the system, σi(t) denotes a microservice viAbnormal level of (A), λiFor micro-service viThe level of importance of.
B3. When the abnormal level S of the whole systemab(t) exceeds a given threshold value thetaab·N(θabIs a threshold value for detecting an abnormal interval, and N is the number of micro services), it is determined that the micro service system is in an abnormal state at that time, and fault diagnosis is required. Wherein the importance level lambdaiAnd thetaabFor a parameter which can be set as desired, λiThe value range is [0, + ∞],θabThe value range is (0, 1)]. Calculating the time with the highest abnormal level in all the abnormal state time points, and recording the time as teIf the abnormal interval is the time interval [ t ] of the performance index datae-Lpre:te+Lpost]Wherein L ispre,LpostThe size of the interval is represented, the value range is the range which does not exceed all available data, the data of the interval is used for a subsequent algorithm and is marked as micro service viAbnormal section data of
Figure BDA0002833848170000052
C. And constructing a micro service dependency graph of the micro service architecture information system. Comprising the following steps, as shown in fig. 2:
C1. and performing time sequence dynamic association analysis to obtain a dynamic association relation between the micro services.
The dynamic associations between each pair of microservices are first analyzed. Assuming the analysis target is the micro-service relationship side vi→vjThe abnormal section data obtained in step B3
Figure BDA0002833848170000061
At the minimum step length L (time length T)bEnumerating all possible sub-intervals, denoted as sliding window sb,eb],b=1,…nwinThen, a vector C with a length T of 0 is initializedijDenotes vi→vjThe dynamic correlation curve of (1). For each sliding window sb,eb]Carrying out Granger causal test, and if the zero hypothesis probability p obtained by the test is less than the significance level alpha, considering that the micro-service v is carried out on the sliding windowi→vjExists in the correlation relation of (A), and dynamically correlates the curve CijValue (C) over the sliding window intervalij[sb:eb]) Increase by 1, otherwise do not make any calculation. After all pairs of microservices and all sliding windows have been processed, vector CijA dynamic correlation curve between the micro services is shown and further analysis is performed.
C2. And setting a self-adaptive threshold, judging whether fault association exists between the micro services, and generating a micro service dependency graph of the micro service system.
After obtaining the dynamic association relationship between the microservices, in order toAnd obtaining qualitative description of whether fault association exists between the micro services, and generating a specific association edge by adopting a thresholding method. For microservice viIn order to judge whether the micro service has correlation with all other micro services, the invention counts the data from the micro service viAll dynamic correlation curves C ofijJ-1, …, N, where N is the number of microservices, and these statistics are recorded with a vector h of length N, i.e.:
hj=∑tCij(t)
wherein C isij(t) is microservice viTo vjThe dynamic correlation curve of (1). Then compute for microservice viAdaptive threshold τ ofi=θeMax (h). For each side vi→vjJ is 1, …, N, if hj≥τiThen the strength of the association on this edge is considered to be large enough, and the edge is added to the finally generated micro-service dependency graph G (V, E, W), where the edge weight WijIs set as hji
D. And obtaining root cause causing the abnormality of the front-end micro-service by adopting a reverse tracking root cause analysis method, thereby finding out the fault root cause micro-service.
After acquiring the micro-service dependency graph of the micro-service architecture information system, the invention adopts a reverse tracking root cause analysis algorithm to score the abnormal degree of each micro-service, sets an abnormal score threshold value, and considers that the micro-service with the abnormal degree score higher than the threshold value is the micro-service causing the front-end micro-service vfeRoot cause of the abnormality. The score of the micro-service abnormal degree comprises path correlation strength and correlation coefficient correlation strength, and is calculated according to the following method:
D1. the path association strength. The path correlation strength measures the possibility that the micro-service causes the failure of the front-end micro-service through the dependent topology of the micro-service system, so the invention uses the front-end micro-service v which goes wrongfeThe back tracking algorithm comprises the following specific steps:
step 1: front-end micro-service v with faultfeAs a destination, in the microservice architecture systemPerforming reverse breadth-first search on a service dependency graph G (V, E, W) (V is a micro-service node set, E is an associated edge set between micro-services, and W is a weight of an associated edge between the micro-services) to obtain a series of paths possibly representing a fault propagation process in the micro-service system, namely a fault propagation chain Pi. To avoid search space explosion and cyclic search, the present invention limits each service to occur in each path a maximum of 1 time, while limiting the number of fault propagation chains generated to within 10000, or to take other user-selected values.
Step 2: the probability of existence of each fault propagation chain is estimated. E.g. fault propagation chain Pi={i1→…→inIn the present invention, the harmonic mean value is used to average the weight of the edge on the fault propagation chain, that is:
Figure BDA0002833848170000071
wherein,
Figure BDA0002833848170000072
propagating the chain upper edge for the fault
Figure BDA0002833848170000073
N is the length of the fault propagation chain.
And step 3: propagate all failures chain PiSorting according to the existing probability from large to small, and selecting the top k fault propagation chains P after sortingr1,Pr2,…,PrkCounting n thereinleadEach leading microservice, counting the number of occurrences of each leading microservice, and dividing by knleadAs the path correlation strength S of the leading microservicepath(vi)。
D2. The correlation coefficient correlates the strength. The correlation strength is obtained by calculating the front-end micro-service v of each micro-service and the faultfeThe absolute correlation coefficient of (a) is obtained, i.e.:
Figure BDA0002833848170000074
wherein
Figure BDA0002833848170000075
Representing microservices viThe abnormal-interval index data of (1),
Figure BDA0002833848170000076
representing front-end micro-services vfeThe abnormal section index data of (1).
D3. And calculating the micro-service abnormal degree score. The degree of anomaly score for each microservice is derived from the average path correlation strength and correlation coefficient correlation strength, i.e., cpathSpath(vi)+ccorrScorr(vi). Finally, the invention sorts the micro-services from big to small according to the abnormal degree scores of the micro-services, and generates a micro-service list vγ1,vγ2,…,vγNThe candidate fault root cause is served, and therefore fault root cause positioning is achieved. The candidate fault root cause service and the generated fault propagation chain may be used to assist in fault diagnosis. The deployment information of the micro-service can be further utilized to locate the host of the fault caused by the micro-service operation, and whether the fault of the hardware layer causes the abnormity of the micro-service can be judged by checking the state (CPU occupancy rate, memory occupancy rate and disk read-write condition) of the host.
According to the fault propagation chain associated with the fault root cause micro-service, the problem of the micro-service architecture information system can be more accurately judged. If the microservices on the fault propagation chain are all deployed on a host or in a network, the fault may be a hardware problem of the host or a problem of network equipment (switch, router), so that operation and maintenance personnel can check the hardware equipment.
The invention also provides a fault root cause positioning system of the micro-service architecture information system, which comprises the following modules: the system comprises an index data collection module, an abnormality detection module, a micro-service dependency graph construction module and a reverse tracking module. The system architecture is shown in fig. 3, and the functions of the modules are as follows:
index data collection module: the index data collection module is used for butting a micro-service architecture information system which needs fault diagnosis, obtaining a request delay index of each micro-service by analyzing a micro-service call log or adopting a Prometheus monitoring tool, and forming corresponding time sequence data for subsequent analysis.
An anomaly detection module: the abnormity detection module analyzes the request delay index data of the micro service and detects whether the micro service architecture information system is in an abnormal state. When the micro-service architecture system is detected to be in an abnormal state, the module collects micro-service index data reflecting the abnormality and forms abnormal interval data, and the abnormal interval data is provided for a subsequent module to carry out further fault analysis.
The micro-service dependency graph building module: the module starts analysis when the micro-service architecture information system is abnormal, constructs a micro-service dependency graph of the micro-service architecture information system when a fault occurs by performing dynamic correlation analysis on the micro-services one by one, and restores a possible fault propagation mode for a follow-up module to perform fine-grained fault root cause positioning and fault chain extraction.
A reverse tracking module: the reverse tracking module starts analysis when the micro-service architecture information system is abnormal, reverse path search is carried out on the generated micro-service dependency graph by taking the abnormal front-end micro-service as an entrance, the probability of each path is estimated, a high-possibility fault propagation chain is formed, the root cause probability of the micro-service is estimated by combining the index data correlation coefficient of the micro-service, and a final fault root cause list is generated.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a fault root cause positioning method and a fault root cause positioning system of a micro service architecture information system, which improve the accuracy of dynamic association modeling in the micro service architecture information system by establishing a dynamic association analysis method, improve the use simplicity of a fault diagnosis tool of the micro service architecture information system by a method driven by micro service performance index data, save the deployment time and energy, improve the interpretability of fault diagnosis of the micro service architecture information system by a fault propagation chain generated by fault diagnosis and improve the accuracy of fault diagnosis of the micro service architecture information system.
Drawings
FIG. 1 is a schematic flow chart of an abnormal interval detection algorithm in the present invention;
wherein sigmaiRepresenting microservices viMoving standard deviation of (a) ("λiFor its importance weight, N is the number of microservices of the microservice architecture information system, θabIs a threshold parameter for abnormal interval detection.
Fig. 2 is a schematic flow chart of the construction of the microservice dependency graph in the present invention.
FIG. 3 is a schematic diagram of a fault root cause location system of the present invention.
Fig. 4 is a graph of 4 microservice request latency data after normalization in an embodiment of the invention.
FIG. 5 is a graph of system anomaly score in an embodiment of the present invention;
wherein L ispre,LpostThe size of the abnormal interval is shown together, N is the number of micro-services in the embodiment, and thetaabIs a threshold parameter of the abnormal interval detection algorithm.
FIG. 6 is a diagram of the micro-service architecture information system dependency in an embodiment of the present invention, wherein the numbered circles are micro-services.
FIG. 7 is a fault propagation chain and its dynamic association curve in an embodiment of the present invention, where the number i and API No. i both represent the microservice viAnd the values of the dynamic correlation curve are normalized.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a fault root cause positioning method and a fault root cause positioning system of a micro-service architecture information system.
In specific implementation, the invention is applied to a micro-service architecture information system, which consists of one or more Linux servers, is connected in a network through a router, and is managed together by a Docker Swarm cluster. In implementation, a distributed software containing a plurality of micro-services is deployed in the micro-service architecture information system, and request log data of the micro-services is collected. The invention establishes a dynamic micro-service correlation analysis method based on Granger causal test and a sliding window, mines the dynamic dependency relationship between micro-services from index data of the micro-services, detects the root cause of the micro-service architecture information system fault through a micro-service fault root cause positioning algorithm based on a fault propagation chain, and generates an explanatory fault propagation chain.
The method comprises the following steps:
A. acquiring micro-service performance index data of a micro-service architecture information system; the performance indicator data includes request delay time series data; in the embodiment, a method for extracting performance index data from micro-service request log data is adopted.
B. Detecting to obtain micro-service abnormal interval data, and identifying whether the micro-service architecture information system is abnormal or not; microservice viIs recorded as an abnormal interval
Figure BDA0002833848170000091
C. Constructing a micro-service dependency graph of a micro-service architecture information system; the method comprises the following steps:
C1. analyzing the time sequence dynamic association to obtain a dynamic association relation between the micro services;
vi、vjtwo microservices; for micro service relationship edge vi→vjObtaining abnormal interval data
Figure BDA0002833848170000101
The time length is T and is in accordance with the minimum step length LbEnumerating all possible sub-intervals, denoted as sliding window sb,eb],i=1,…nwin(ii) a Initializing a vector C of length T and value 0ijDenotes vi→vjThe dynamic correlation curve of (2);
for each sliding window si,ei]Performing Granger causal test, and if the zero hypothesis probability p obtained by the test is less than the significance level alpha, micro-service v on the sliding windowi→vjExists in the correlation relation of (A), and dynamically correlates the curve CijValue (C) over the sliding window intervalij[si:ei]) Increasing 1, otherwise, not making any calculation;
when all paired microservices and all sliding windows are processed, obtaining a vector CijRepresenting a dynamic association curve between microservices;
C2. setting a self-adaptive threshold value for judging whether fault association exists between the micro services and generating a micro service dependency graph of the micro service system; the method comprises the following steps:
C21. calculating an adaptive threshold;
for microservice viBy statistics from microservices viAll dynamic correlation curves C ofijJ is 1, …, N, where N is the number of microservices, and a statistic is recorded with a vector h of length N, whose jth component is calculated as:
Figure BDA0002833848170000102
wherein, Cij(t) is microservice viTo vjThe dynamic correlation curve of (2);
then compute microservice viAdaptive threshold τi=θe·max(h);
C22. Generating association edges among the micro services by adopting a thresholding method, and generating a micro service dependency graph;
for each side vi→vjJ is 1, …, N, if hj≥τiThen the strength of the association of the edge is large enough to add the edge to the final generated oneIn the micro-service dependency graph G (V, E, W), where V is the micro-service node set, E is the associated edge set between micro-services, and W is the weight of the associated edge between micro-services, where W isijIs set as hji
D. Obtaining root cause causing the abnormality of the front-end micro-service by adopting a reverse tracking root cause analysis method, and finding out fault root cause service;
scoring the abnormal degree of each micro-service by adopting a back-tracking root cause analysis algorithm, wherein the scoring comprises path correlation strength and correlation coefficient correlation strength; setting an abnormal grade threshold value, and considering that the micro-service with the abnormal degree grade higher than the threshold value is the micro-service v causing the front endfeA root cause of the abnormality; the score of the micro-service abnormal degree is calculated according to the following method:
D1. front-end micro-service v using slave failuresfeCalculating the path association strength by using a reverse tracking algorithm;
the path correlation strength is used for measuring the possibility that the micro service causes the failure of the front-end micro service through the dependent topology of the micro service system; the specific calculation comprises the following steps:
step 1: front-end micro-service v with faultfeFor the end point, a reverse breadth-first search is carried out on a micro service dependency graph G (V, E, W) of the micro service architecture system to obtain a series of paths which possibly represent the propagation process of the fault in the micro service system, namely a fault propagation chain Pi,Pi={i1→…→in};
Step 2: estimating the existence probability of each fault propagation chain;
the existence probability refers to the probability that the fault employs this fault propagation chain. Since the actual fault cannot be obtained exactly in which way, it can be estimated approximately how the fault propagates. Similarly, the weight of an edge of the service dependency graph represents the probability of the fault propagating on the edge, i.e., the fault of one micro-service has a certain probability of affecting the micro-services adjacent to the fault. The invention uses harmonic mean values to average the weights of all edges on a fault propagation chain, thereby obtaining an estimate of the overall probability.
Mean fault propagation chain P using harmonic meansiThe weight of the above edge is expressed as:
Figure BDA0002833848170000111
wherein,
Figure BDA0002833848170000112
propagating the chain upper edge for the fault
Figure BDA0002833848170000113
N is the length of the fault propagation chain;
and step 3: propagate all failures chain PiSorting according to the existing probability from large to small, and selecting the top k fault propagation chains P after sortingr1,Pr2,…,PrkCounting n thereinleadEach leading microservice, counting the number of occurrences of each leading microservice, and dividing by knleadThe strength of the path association of the leading microservice is denoted as Spath(vi);
D2. Calculating correlation coefficient correlation strength;
correlation coefficient correlation strength is calculated by calculating front-end micro-service v of each micro-service and faultfeThe absolute correlation coefficient of (a) is obtained, i.e.:
Figure BDA0002833848170000114
wherein,
Figure BDA0002833848170000115
representing microservices viThe abnormal-interval index data of (1),
Figure BDA0002833848170000116
representing front-end micro-services vfeThe abnormal section index data of (1);
D3. calculating a micro-service abnormal degree score:the average path correlation strength and the correlation coefficient correlation strength are added, namely cpathSpath(vi)+ccorrScorr(vi);
And then the micro-services are ranked from big to small according to the abnormal degree scores of the micro-services, and a generated micro-service list vγ1,vγ2,…,vγNThe candidate fault root cause is served, and therefore fault root cause positioning is achieved.
The invention discloses a system for realizing a fault root cause positioning method of a micro-service architecture information system, which comprises the following modules: the system comprises an index data collection module, an abnormality detection module, a micro-service dependency graph construction module and a reverse tracking module; wherein: the index data collection module is used for butting a micro-service architecture information system which needs fault diagnosis, acquiring a request delay index of each micro-service and forming corresponding time sequence data; the anomaly detection module is used for analyzing the request delay index data of the micro-service and detecting whether the micro-service architecture information system is in an abnormal state; when the micro-service architecture system is detected to be in an abnormal state, collecting micro-service index data reflecting the abnormality, forming abnormal interval data, and providing the abnormal interval data for a subsequent module for further fault analysis; the micro-service dependency graph construction module is used for analyzing when the micro-service architecture information system is abnormal, constructing a micro-service dependency graph of the micro-service architecture information system when a fault occurs by performing dynamic association analysis on the micro-services in pairs, and restoring a possible fault propagation mode for fine-grained fault root cause positioning and fault chain extraction of a subsequent module; the reverse tracking module is used for analyzing the generated micro-service dependency graph, carrying out reverse path search by taking the abnormal front-end micro-service as an entrance, estimating the probability of each path, forming a high-possibility fault propagation chain, estimating the root cause probability of the micro-service by combining the index data correlation coefficient of the micro-service, and generating a final fault root cause list.
The following shows the process of fault diagnosis of the present invention on a commercial microservice system containing 33 microservices.
According to step a, the present invention obtains request delay information of each microservice on the microservice system by using request log data, and collects data with a length of 7199 seconds in total, and fig. 4 shows the request delay data of 4 microservices after normalization.
According to step B, the invention calculates the degree of abnormality of the individual microservices and the degree of abnormality of the system as a whole, where the parameter LwIs 50 seconds, lambdaiTake 1.0, thetaabTake 0.3. FIG. 5 shows the anomaly level of the microservice system in 7199 seconds as a whole, and the time point of 4653 as the anomaly interval of the microservice system is finally selected through calculation and is according to Lpre,Lpost0,280 output failure intervals [4653,4933 ]]。
According to the step C, the dynamic association analysis based on the sliding window is carried out on the data of the fault section generated in the step B, and finally the association diagram of the micro-service system is generated. In the present example, the parameter LbIs 70, alpha is 0.1, thetaeFig. 6 shows a dependency graph of the microservice system, 0.5.
According to the step D, the invention carries out reverse path tracking according to the dependence graph of the micro-service system, generates a series of candidate fault propagation chains and finally provides a service list of the fault root. In the limiting search, the number of fault propagation chains is 10000, the parameter k is 50, and n isleadIs 3, cpath,ccorrFor 1.0, the top 10 failure propagation chains given in this example and the corresponding probabilities are shown in table 2. The resulting fault root cause service list and the corresponding abnormality degree score are shown in table 3.
Chain of propagation of faults Estimating presence probability
[14,21,5,17,27] 0.3976
[14,21,5,17,16,20,11,29,2,12,7] 0.3645
[14,21,5,17,16,20,25,29,2,12,7] 0.3645
[14,21,5,17,16,20,25,6,31,30,33] 0.3619
[14,21,5,17,16,20,11,6,31,30,12] 0.3619
[14,21,5,17,16,20,11,6,31,30,33] 0.3619
[14,21,5,17,16,20,25,6,31,30,12] 0.3619
[14,21,5,17,16,20,11,6,30,12,7] 0.3568
[14,21,5,17,16,20,25,6,30,12,7] 0.3568
[14,21,5,17,28,30,12,7,19,29,2] 0.3551
Table 2 fault propagation chain in an example system
Fault servicing Degree of abnormality scoring
30 0.5618049
31 0.5271997
6 0.4575075
28 0.4018228
33 0.3257960
12 0.3107594
7 0.2501501
29 0.2021703
19 0.1946196
3 0.1944521
27 0.1828436
17 0.0949014
5 0.0850567
2 0.0765803
Table 3 list of fault root cause service results in an exemplary system
In order to demonstrate the capability of the dynamic correlation curve proposed by the present invention to describe fault propagation, fig. 7 shows the dynamic correlation curve on a fault propagation chain. Propagation process (v) in micro-service architecture information system with fault27→v22→v21→v14) It can be seen that the dynamic correlation curve also reflects the trend of gradual movement of the fault in the time dimension.
The deployment information of the micro-service is utilized to further position the host of the fault caused by the micro-service operation, and whether the fault of the hardware layer causes the abnormity of the micro-service can be judged by checking the state of the host (CPU occupancy rate, memory occupancy rate and disk read-write condition). If the hardware of the micro service operation has no problem, the parameters of the container management platform Docker Swarm for deploying the micro service are further checked to determine whether the error configuration causing the micro service to fail exists. If neither of the two is problematic, the fault is considered to be present in the software code of the microservice, and further analysis of the software code is required. The problem of the micro-service architecture information system can be more accurately judged by combining a fault propagation chain associated with the fault root cause micro-service. If the microservices on the fault propagation chain are all deployed on one host or in one network, the fault may be a hardware problem of the host or a problem of the network device (switch, router). The operation and maintenance personnel can then inspect these hardware devices.
After verification with enterprise operation and maintenance personnel of the micro-service system, the fault root cause diagnosis of the system finally achieves 100% accuracy, namely the actual 4 fault root cause services are all in the first 4 of a given fault service result list.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (6)

1. A fault root cause positioning method of a micro-service architecture information system is characterized in that a root cause positioning algorithm based on a fault propagation chain model is designed by establishing a dynamic correlation analysis method among micro-services, the propagation process of related faults is identified while fault root cause services are positioned, a fault propagation chain is generated, the interpretability of fault positioning and diagnosis is improved, and the method can be used in the micro-service architecture information system; the method comprises the following steps:
A. acquiring micro-service performance index data of a micro-service architecture information system; the performance indicator data includes request delay time series data;
B. detecting to obtain micro-service abnormal interval data, and identifying whether the micro-service architecture information system is abnormal or not; microservice viIs recorded as abnormal interval data
Figure FDA0002833848160000011
C. Constructing a micro-service dependency graph of a micro-service architecture information system; the method comprises the following steps:
C1. analyzing the time sequence dynamic association to obtain a dynamic association relation between the micro services;
vi、vjtwo microservices; for micro service relationship edge vi→vjObtaining abnormal interval data
Figure FDA0002833848160000012
The time length is T and is in accordance with the minimum step length LbEnumerating all possible sub-intervals, denoted as sliding window sb,eb],b=1,…nwinWherein s isb,ebRespectively a start point and an end point of the sliding window, nwinA total number of sliding windows that is an enumeration; initializing a vector C of length T and value 0ijDenotes vi→vjThe dynamic correlation curve of (2);
for each sliding window sb,eb]Performing Granger causal test, and if the zero hypothesis probability p obtained by the test is less than the significance level alpha, micro-service v on the sliding windowi→vjExists in the correlation relation of (A), and dynamically correlates the curve CijValue (C) over the sliding window intervalij[sb:eb]) Increasing 1, otherwise, not making any calculation;
when all paired microservices and all sliding windows are processed, obtaining a vector CijRepresenting a dynamic association curve between microservices;
C2. setting a self-adaptive threshold value for judging whether fault association exists between the micro services and generating a micro service dependency graph of the micro service system;
C21. adopting a thresholding method to generate an association edge between the micro services;
for microservice viBy statistics from microservices viAll dynamic correlation curves C ofijJ is 1, …, N is the number of microservices and a statistic is recorded with a vector h of length N, expressed as:
hj=∑tCij(t)
wherein, Cij(t) is microservice viTo vjThe dynamic correlation curve of (2);
C22. computing for microservices viAdaptive threshold τ ofi=θeMax (h), judging the correlation strength of the edges, and generating a micro-service dependency graph G;
for each side vi→vjJ is 1, …, N, ifhj≥τiThe strength of the association of the edge is large enough to add the edge to the finally generated micro-service dependency graph G (V, E, W), where V is the set of micro-service nodes, E is the set of associated edges between micro-services, W is the weight of the associated edges between micro-services, and W isijIs set as hji
D. Obtaining root cause causing the abnormality of the front-end micro-service by adopting a reverse tracking root cause analysis method, and finding out fault root cause service;
scoring the abnormal degree of each micro-service by adopting a back-tracking root cause analysis algorithm, wherein the scoring comprises path correlation strength and correlation coefficient correlation strength; setting an abnormal grade threshold value, and considering that the micro-service with the abnormal degree grade higher than the threshold value is the micro-service v causing the front endfeA root cause of the abnormality; the score of the micro-service abnormal degree is calculated according to the following method:
D1. front-end micro-service v using slave failuresfeCalculating the path association strength by using a reverse tracking algorithm;
the path correlation strength is used for measuring the possibility that the micro service causes the failure of the front-end micro service through the dependent topology of the micro service system; the specific calculation comprises the following steps:
step 1: front-end micro-service v with faultfeFor the end point, a reverse breadth-first search is carried out on a micro service dependency graph G (V, E, W) of the micro service architecture system to obtain a series of paths which possibly represent the propagation process of the fault in the micro service system, namely a fault propagation chain Pi,Pi={i1→…→in};
Step 2: estimating the existence probability of each fault propagation chain;
mean fault propagation chain P using harmonic meansiThe weight of the above edge is expressed as:
Figure FDA0002833848160000021
wherein,
Figure FDA0002833848160000022
propagating the chain upper edge for the fault
Figure FDA0002833848160000023
N is the length of the fault propagation chain;
and step 3: propagate all failures chain PiSorting according to the existing probability from large to small, and selecting the top k fault propagation chains P after sortingr1,Pr2,…,PrkIn which P isrkFor the rk th fault propagation chain, counting n in the rk th fault propagation chainleadEach leading microservice, counting the number of occurrences of each leading microservice, and dividing by knleadThe strength of the path association of the leading microservice is denoted as Spath(vi);
D2. Calculating correlation coefficient correlation strength;
correlation coefficient correlation strength is calculated by calculating front-end micro-service v of each micro-service and faultfeThe absolute correlation coefficient of (a) is obtained, i.e.:
Figure FDA0002833848160000024
wherein,
Figure FDA0002833848160000031
representing microservices viThe abnormal-interval index data of (1),
Figure FDA0002833848160000032
representing front-end micro-services vfeThe abnormal section index data of (1);
D3. calculating a micro-service abnormal degree score: adding the path correlation strength and the correlation coefficient correlation strength, namely cpathSpath(vi)+ccorrScorr(vi);
And then the micro-services are ranked from big to small according to the abnormal degree scores of the micro-services, and a generated micro-service list vγ1,vγ2,…,vγNThe candidate fault root cause is served, and therefore fault root cause positioning is achieved.
2. The method for locating the fault root cause of the microservice architecture information system of claim 1, wherein each microservice in the microservice architecture information system is configured in a container by a Docker, different microservices are communicated by means of an HTTP API or a message queue, and a tool for collecting performance index data is provided for acquiring the performance index of each microservice according to a user request log or active sampling.
3. The method for locating the fault root cause of the microservice architecture information system as claimed in claim 1, wherein the method for obtaining microservice performance indicator data of the microservice architecture information system comprises:
extracting delay information of each access request from a request log of the micro-service architecture information system, and averaging the request delay within 1 second to obtain request delay time sequence data of each micro-service per second;
or a Prometheus tool is adopted to directly obtain the request delay data per second from the microservice of the microservice architecture information system and construct a request delay time sequence;
and then, normalizing the acquired and output micro-service request delay time sequence.
4. The method as claimed in claim 1, wherein the collected microservice indicator data is recorded as a time series Mi(t) detecting an abnormal interval in which the micro-service is obtained by:
B1. calculating each microservice indicator in a sliding window LwInner moving standard deviation sigmai(t) indicating the degree of abnormality of the micro-service;
B2. according to importance level lambda for all micro-servicesiWeighting is performed, and the anomaly level of the whole system at the time t is calculated:
Figure FDA0002833848160000033
wherein S isab(t) is the abnormal level of the system; sigmai(t) denotes a microservice viAbnormal level of (A), λiFor micro-service viOf importance, λiThe value range is [0, + ∞];
B3. When the abnormal level S of the systemab(t) exceeds a given threshold value thetaabN, where θ isabThreshold value for abnormal section detection, θabThe value range is (0, 1)](ii) a N is the number of micro-services; judging that the micro-service system is in an abnormal state at the moment and needing fault diagnosis;
calculating the time with the highest abnormal level in all the abnormal state time points, and recording the time as teIf the abnormal interval is the time interval [ t ] of the performance index datae-Lpre:te+Lpost]Wherein L ispre,LpostThe size of the interval is represented, the value range is not more than the range of all available data and is marked as abnormal interval data
Figure FDA0002833848160000041
5. A system for implementing the method for locating a fault root cause of the micro-service architecture information system of claim 1, comprising the following modules: the system comprises an index data collection module, an abnormality detection module, a micro-service dependency graph construction module and a reverse tracking module; wherein:
the index data collection module is used for butting a micro-service architecture information system which needs fault diagnosis, acquiring a request delay index of each micro-service and forming corresponding time sequence data;
the anomaly detection module is used for analyzing the request delay index data of the micro-service and detecting whether the micro-service architecture information system is in an abnormal state; when the micro-service architecture system is detected to be in an abnormal state, collecting micro-service index data reflecting the abnormality, forming abnormal interval data, and providing the abnormal interval data for a subsequent module for further fault analysis;
the micro-service dependency graph construction module is used for analyzing when the micro-service architecture information system is abnormal, constructing a micro-service dependency graph of the micro-service architecture information system when a fault occurs by performing dynamic association analysis on the micro-services in pairs, and restoring a possible fault propagation mode for fine-grained fault root cause positioning and fault chain extraction of a subsequent module;
the reverse tracking module is used for analyzing the generated micro-service dependency graph, carrying out reverse path search by taking the abnormal front-end micro-service as an entrance, estimating the probability of each path, forming a high-possibility fault propagation chain, estimating the root cause probability of the micro-service by combining the index data correlation coefficient of the micro-service, and generating a final fault root cause list.
6. The system of claim 5, wherein the index data collection module is further configured to obtain the request delay index for each microservice by analyzing a microservice call log or by using a Prometheus monitoring tool.
CN202011468424.2A 2020-12-14 2020-12-14 Fault root cause positioning method and system of micro-service architecture information system Active CN112698975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011468424.2A CN112698975B (en) 2020-12-14 2020-12-14 Fault root cause positioning method and system of micro-service architecture information system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011468424.2A CN112698975B (en) 2020-12-14 2020-12-14 Fault root cause positioning method and system of micro-service architecture information system

Publications (2)

Publication Number Publication Date
CN112698975A true CN112698975A (en) 2021-04-23
CN112698975B CN112698975B (en) 2022-09-27

Family

ID=75507890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011468424.2A Active CN112698975B (en) 2020-12-14 2020-12-14 Fault root cause positioning method and system of micro-service architecture information system

Country Status (1)

Country Link
CN (1) CN112698975B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113271224A (en) * 2021-05-17 2021-08-17 中国邮政储蓄银行股份有限公司 Node positioning method and device, storage medium and electronic device
CN113392893A (en) * 2021-06-08 2021-09-14 北京达佳互联信息技术有限公司 Method, device, storage medium and computer program product for positioning service fault
CN113391943A (en) * 2021-06-18 2021-09-14 广东工业大学 Micro-service fault root cause positioning method and device based on cause and effect inference
CN113467421A (en) * 2021-07-01 2021-10-01 中国科学院计算技术研究所 Method for acquiring micro-service health status index and micro-service abnormity diagnosis method
CN113900844A (en) * 2021-09-26 2022-01-07 北京必示科技有限公司 Service code level-based fault root cause positioning method, system and storage medium
CN114024837A (en) * 2022-01-06 2022-02-08 杭州大乘智能科技有限公司 Fault root cause positioning method of micro-service system
CN114124738A (en) * 2021-11-04 2022-03-01 昆明理工大学 Cloud environment service fault probability calculation method, system and terminal based on service interaction graph
CN114325232A (en) * 2021-12-28 2022-04-12 微梦创科网络科技(中国)有限公司 Fault positioning method and device
CN114780385A (en) * 2022-03-24 2022-07-22 中国科学院软件研究所 Performance bottleneck analysis method and system for micro-service architecture application
CN115118621A (en) * 2022-06-27 2022-09-27 浙江大学 Micro-service performance diagnosis method and system based on dependency graph
CN115278741A (en) * 2022-06-15 2022-11-01 清华大学 Fault diagnosis method and device based on multi-mode data dependency relationship
CN115333921A (en) * 2022-08-20 2022-11-11 海南大学 Micro-service abnormal root cause positioning method and device
CN115348159A (en) * 2022-08-09 2022-11-15 国家电网有限公司信息通信分公司 Micro-service fault positioning method and device based on self-encoder and service dependency graph
CN115756929A (en) * 2022-11-23 2023-03-07 北京大学 Abnormal root cause positioning method and system based on dynamic service dependency graph
WO2023109251A1 (en) * 2021-12-17 2023-06-22 浪潮电子信息产业股份有限公司 System fault detection method and apparatus, device, and medium
CN116450399A (en) * 2023-06-13 2023-07-18 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN117196651A (en) * 2023-08-09 2023-12-08 首都经济贸易大学 Enterprise abnormity monitoring method and device based on data asynchronous processing and storage medium
CN117520040A (en) * 2024-01-05 2024-02-06 中国民航大学 Micro-service fault root cause determining method, electronic equipment and storage medium
WO2024139525A1 (en) * 2022-12-28 2024-07-04 中移物联网有限公司 Root cause analysis method and apparatus, electronic device, and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190018753A1 (en) * 2017-07-12 2019-01-17 Fujitsu Limited Software program fault localization
CN109933452A (en) * 2019-03-22 2019-06-25 中国科学院软件研究所 A kind of micro services intelligent monitoring method towards anomalous propagation
CN111290900A (en) * 2020-01-16 2020-06-16 中山大学 Software fault detection method based on micro-service log
CN111694721A (en) * 2020-06-15 2020-09-22 南方电网科学研究院有限责任公司 Fault monitoring method and device for microservice
CN111722952A (en) * 2020-05-25 2020-09-29 中国建设银行股份有限公司 Fault analysis method, system, equipment and storage medium of business system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190018753A1 (en) * 2017-07-12 2019-01-17 Fujitsu Limited Software program fault localization
CN109933452A (en) * 2019-03-22 2019-06-25 中国科学院软件研究所 A kind of micro services intelligent monitoring method towards anomalous propagation
CN111290900A (en) * 2020-01-16 2020-06-16 中山大学 Software fault detection method based on micro-service log
CN111722952A (en) * 2020-05-25 2020-09-29 中国建设银行股份有限公司 Fault analysis method, system, equipment and storage medium of business system
CN111694721A (en) * 2020-06-15 2020-09-22 南方电网科学研究院有限责任公司 Fault monitoring method and device for microservice

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113271224A (en) * 2021-05-17 2021-08-17 中国邮政储蓄银行股份有限公司 Node positioning method and device, storage medium and electronic device
CN113392893A (en) * 2021-06-08 2021-09-14 北京达佳互联信息技术有限公司 Method, device, storage medium and computer program product for positioning service fault
CN113391943A (en) * 2021-06-18 2021-09-14 广东工业大学 Micro-service fault root cause positioning method and device based on cause and effect inference
CN113467421A (en) * 2021-07-01 2021-10-01 中国科学院计算技术研究所 Method for acquiring micro-service health status index and micro-service abnormity diagnosis method
CN113900844A (en) * 2021-09-26 2022-01-07 北京必示科技有限公司 Service code level-based fault root cause positioning method, system and storage medium
CN114124738A (en) * 2021-11-04 2022-03-01 昆明理工大学 Cloud environment service fault probability calculation method, system and terminal based on service interaction graph
CN114124738B (en) * 2021-11-04 2024-03-19 昆明理工大学 Cloud environment service fault probability calculation method, system and terminal based on service interaction diagram
WO2023109251A1 (en) * 2021-12-17 2023-06-22 浪潮电子信息产业股份有限公司 System fault detection method and apparatus, device, and medium
CN114325232A (en) * 2021-12-28 2022-04-12 微梦创科网络科技(中国)有限公司 Fault positioning method and device
CN114325232B (en) * 2021-12-28 2023-07-25 微梦创科网络科技(中国)有限公司 Fault positioning method and device
CN114024837A (en) * 2022-01-06 2022-02-08 杭州大乘智能科技有限公司 Fault root cause positioning method of micro-service system
CN114780385A (en) * 2022-03-24 2022-07-22 中国科学院软件研究所 Performance bottleneck analysis method and system for micro-service architecture application
CN114780385B (en) * 2022-03-24 2024-09-20 中国科学院软件研究所 Performance bottleneck analysis method and system for micro-service architecture application
CN115278741A (en) * 2022-06-15 2022-11-01 清华大学 Fault diagnosis method and device based on multi-mode data dependency relationship
CN115118621A (en) * 2022-06-27 2022-09-27 浙江大学 Micro-service performance diagnosis method and system based on dependency graph
CN115348159A (en) * 2022-08-09 2022-11-15 国家电网有限公司信息通信分公司 Micro-service fault positioning method and device based on self-encoder and service dependency graph
CN115348159B (en) * 2022-08-09 2023-06-27 国家电网有限公司信息通信分公司 Micro-service fault positioning method and device based on self-encoder and service dependency graph
CN115333921A (en) * 2022-08-20 2022-11-11 海南大学 Micro-service abnormal root cause positioning method and device
CN115333921B (en) * 2022-08-20 2024-03-29 海南大学 Micro-service abnormal root cause positioning method and device
CN115756929B (en) * 2022-11-23 2023-06-02 北京大学 Abnormal root cause positioning method and system based on dynamic service dependency graph
CN115756929A (en) * 2022-11-23 2023-03-07 北京大学 Abnormal root cause positioning method and system based on dynamic service dependency graph
WO2024139525A1 (en) * 2022-12-28 2024-07-04 中移物联网有限公司 Root cause analysis method and apparatus, electronic device, and readable storage medium
CN116450399B (en) * 2023-06-13 2023-08-22 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN116450399A (en) * 2023-06-13 2023-07-18 西华大学 Fault diagnosis and root cause positioning method for micro service system
CN117196651A (en) * 2023-08-09 2023-12-08 首都经济贸易大学 Enterprise abnormity monitoring method and device based on data asynchronous processing and storage medium
CN117196651B (en) * 2023-08-09 2024-05-03 首都经济贸易大学 Enterprise abnormity monitoring method and device based on data asynchronous processing and storage medium
CN117520040A (en) * 2024-01-05 2024-02-06 中国民航大学 Micro-service fault root cause determining method, electronic equipment and storage medium
CN117520040B (en) * 2024-01-05 2024-03-08 中国民航大学 Micro-service fault root cause determining method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112698975B (en) 2022-09-27

Similar Documents

Publication Publication Date Title
CN112698975B (en) Fault root cause positioning method and system of micro-service architecture information system
CN109933452B (en) Micro-service intelligent monitoring method facing abnormal propagation
US11500757B2 (en) Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data
EP3745272B1 (en) An application performance analyzer and corresponding method
US9389946B2 (en) Operation management apparatus, operation management method, and program
US8086708B2 (en) Automated and adaptive threshold setting
US7693982B2 (en) Automated diagnosis and forecasting of service level objective states
US8635498B2 (en) Performance analysis of applications
US8650137B2 (en) Method and apparatus for creating state estimation models in machine condition monitoring
US8677191B2 (en) Early detection of failing computers
US20150219530A1 (en) Systems and methods for event detection and diagnosis
Hoffmann et al. Advanced failure prediction in complex software systems
US11250043B2 (en) Classification of log data
CN113852603B (en) Abnormality detection method and device for network traffic, electronic equipment and readable medium
US20060293777A1 (en) Automated and adaptive threshold setting
JP6183450B2 (en) System analysis apparatus and system analysis method
JP6564799B2 (en) Threshold determination device, threshold determination method and program
CN115237717A (en) Micro-service abnormity detection method and system
CN104639368A (en) Method and device for processing faults of communications network equipment
CN107426019A (en) Network failure determines method, computer equipment and computer-readable recording medium
US9235463B2 (en) Device and method for fault management of smart device
CN115599077B (en) Vehicle fault delimiting method and device, electronic equipment and storage medium
JP2016045556A (en) Inter-log cause-and-effect estimation device, system abnormality detector, log analysis system, and log analysis method
WO2022059720A1 (en) Structure diagnosis system, structure diagnosis method, and structure diagnosis program
CN115118621A (en) Micro-service performance diagnosis method and system based on dependency graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant