CN109933452B - Micro-service intelligent monitoring method facing abnormal propagation - Google Patents

Micro-service intelligent monitoring method facing abnormal propagation Download PDF

Info

Publication number
CN109933452B
CN109933452B CN201910220179.4A CN201910220179A CN109933452B CN 109933452 B CN109933452 B CN 109933452B CN 201910220179 A CN201910220179 A CN 201910220179A CN 109933452 B CN109933452 B CN 109933452B
Authority
CN
China
Prior art keywords
service
abnormal
calling
monitoring
interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910220179.4A
Other languages
Chinese (zh)
Other versions
CN109933452A (en
Inventor
王焘
张文博
薛晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN201910220179.4A priority Critical patent/CN109933452B/en
Publication of CN109933452A publication Critical patent/CN109933452A/en
Application granted granted Critical
Publication of CN109933452B publication Critical patent/CN109933452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention relates to an intelligent monitoring method of micro-services facing abnormal propagation, which monitors service calling information based on a proxy technology and establishes a topological graph for calling the micro-services to depict the abnormal propagation relation among the micro-services; adopting a Lasso regression modeling interface to call and measure the correlation, and detecting abnormal micro-services by monitoring the change of a correlation model; based on the PageRank algorithm to evaluate the abnormal degree of the micro-service and the calling interface thereof, the invention realizes transparent service monitoring, automatic measurement value prediction to find abnormal service, and intelligent evaluation of the abnormal degree of the nodes in the graph to detect the root cause of the problem.

Description

Micro-service intelligent monitoring method facing abnormal propagation
Technical Field
The invention relates to a fault diagnosis method of a micro-service software system, in particular to an intelligent monitoring method of micro-service for abnormal propagation, belonging to the technical field of software.
Background
The monolithic architecture and the SOA software architecture are the architecture forms commonly adopted by software companies, and through the development of over a decade, software systems become abnormally complex, have low expansibility and maintainability, and bear heavy technical debts. The existing internet is competitive, user requirements and market environments are in rapid changes all the time, when the existing internet is applied, the expansibility and flexibility of the traditional software architecture form are obviously insufficient, and the design, development, test, operation and maintenance costs are obviously increased. Therefore, the concept of microservice is proposed, which is a software architecture that treats a single application as a suite of software services, each running in a separate process, communicating with each other via lightweight protocols. The characteristics of the micro-service architecture are very suitable for agile development and continuous integration, the pain of the traditional software architecture is solved, and the extensive attention and research of academia and industry are obtained.
After the software system is micro-serviced, the maintenance and the flexibility are improved, meanwhile, the dependency relationship between services is complicated, and the probability of occurrence of faults and the loss caused by the faults are increased. For example, in a high traffic website, a delay of a certain service component may cause all application resources to be exhausted, causing a so-called avalanche effect, which may seriously cause the whole system to be broken down. Therefore, the effective monitoring system and the rapid positioning of the fault cause are one of the key technologies for guaranteeing the reliability and the performance of the micro-service.
The following categories of work are mainly done for microservice fault diagnosis: (1) a diagnostic method based on metric monitoring. The method mainly collects system operation indexes such as CPU, memory, network and the like so as to reflect the current state of the application program and the operation trend in a period of time. If a certain measure exceeds a preset threshold, the system is in a problem and an alarm is triggered, then an administrator solves the problem by taking monitoring data as a basis and combining self experience (Wang T, Zhang W, Ye C, Wei J, Zhong H, Huang T.FDC: Automatic factory Diagnosis Framework for Web application protocol IEEE Transactions on Systems, Man, and Cybernetics: s.2016,46(1): 61-75; M.Farshchi, J.G.Schneider, I.Weber, and J.Grundy, "method output and analysis detection for closed operations and simulation analysis," Journal of system and 2018,137 log 549); (2) the log-based monitoring and analyzing method has the advantages that the log clearly records the operation condition of the system, is convenient to persist and can be easily searched, and is generally an effective means (ELK. https:// www.elastic.co /) for finding out the cause of the fault and supporting more business targets; (3) a monitoring and diagnosing method Based on distributed request tracing includes obtaining request Execution path by label-Based method, finding out system fault by analyzing Execution path or comparing path (A. Nandi, A. Mandal, S. Atreja, G.B. Dasgugpta, and S.Bhattacharya, "analysis detection Using Program Control Flow Graph calculation Logs,"22nd ACM SIGKDD International reference on Knowledge Discovery and Data Mining, San Francis, California, USA, 2016; T.Jia, P.N, L.Yang, Y.Li, F.Meng and J.xu, "An application for analysis Based on graphics networking service log, IEEE 25. report 32). The monitoring fault diagnosis mode based on measurement and log is simple to realize, but the overall state of the system cannot be reflected, the service flow cannot be tracked, the fault location level is usually a service component, and in a complex micro-service interaction relationship, a manager consumes a large amount of time to search and locate problems; the monitoring and diagnosing method based on distributed request tracking monitors the request track through a log or a code implantation mode to be used as a reference for fault diagnosis, but the method has low monitoring expansibility, cannot be transparent to application, and does not consider the problem of abnormal propagation.
Disclosure of Invention
The technical problem of the invention is solved: the defects of the prior art are overcome, and the micro-service-oriented high-efficiency fault diagnosis system is provided. By transparently calling and monitoring the service, the expansibility of the system is improved, and the influence of monitoring on the operation of the micro-service is reduced; and fine-grained fault root cause positioning at the interface level is realized by analyzing the monitoring data.
The technical scheme of the invention is as follows: an intelligent monitoring method for micro-services facing abnormal propagation comprises the following implementation steps:
step one, monitoring service call: monitoring service invocation information based on proxy technology, using tuple NiRecording a service call relation as (requestUID, serviceUID, span uid, parenuid, info), wherein the requestUID is a request identifier and is generated at a request entrance; the serviceUID is a service identifier; the span UID is a service calling span identifier; if the parentUID is a parent span identifier, if the parentUID is-1, the current span is a root span; the info contains other information, represented by tuple info ═ duration, where serviceUID is uniquely identified by service component and instance number; startTime and endTime are service invocation start and end times; duration is service callAnd executing the time. Based on the monitored service invocation information, the specific process of constructing the service invocation topological graph is as follows:
(1) in the initial stage, the topological graph G is empty, and the set S contains the collected calling information;
(2) taking out the tuples belonging to the same request and having a calling relationship from the set S, taking the service instance represented by the serviceUID in the tuples as a point, adding the calling relationship into G as a directed edge, and if the point or the edge already exists, not repeatedly adding;
(3) if the set S is not empty, then execution continues with (2). Otherwise, the algorithm ends.
Step two, abnormal service detection: constructing a correlation model between the number of times of calling the interfaces in the service and the service monitoring measurement, and specifically comprising the following steps:
(1) monitoring data of the intra-service measurement and data of the number of calls of all interfaces in the service are collected. For a metric m within a certain service S, a vector is used
Figure BDA0002003349220000031
Representing the number of calls made by service i to q interfaces within the service at time t, where
Figure BDA0002003349220000032
Indicating that service i calls the intra-service number t at time t1The number of interfaces of (1) is normalized and used as an explanatory variable of the Lasso regression model. By YtRepresenting the monitoring value of the metric m at the moment t, and taking the monitoring value as a response variable of the Lasso regression model;
(2) and (3) constructing a Lasso regression model based on the data, wherein the independent variable of the model is a vector formed by the calling times of the service interface obtained in the step (1), and the dependent variable is a monitoring value of a certain measurement m at the moment t. The regression model further constructed was:
Figure BDA0002003349220000033
wherein
Figure BDA0002003349220000034
For the regression coefficients, α is the random error term
Figure BDA0002003349220000035
Then, the method is determined by a coordinate descent method
Figure BDA0002003349220000036
Minimized regression coefficients and error terms;
(3) and selecting an adjusting parameter t by adopting a generalized cross-validation method, wherein the generalized cross-validation method has the form:
Figure BDA0002003349220000037
where rss (c) represents the sum of squared residuals:
Figure BDA0002003349220000038
p (c) is the number of effective regression coefficients in the Lasso regression;
(4) in the service operation process, predicting the metric value based on a Lasso regression model, and calculating a residual error:
Figure BDA0002003349220000039
when the absolute value of the residual error is larger than a set threshold value, determining that the measurement is abnormal, and further considering that the service is abnormal;
thirdly, fault service diagnosis: and constructing fault propagation subgraphs for all abnormal services according to the calling relations of the abnormal services based on the data obtained in the first two steps. In the subgraph, the PageRank algorithm is used for scoring the abnormal degree of each service, and the specific steps are as follows:
(1) in the initial stage, the proportion of abnormal measurement in service is used as the initial value of PR for the service, and P ═ P0,p1,...,pn]TA column vector formed for a plurality of initial values of PR for a service, where piIs the proportion of the anomaly measure in service i;
(2) computing service piHas a PR value of
Figure BDA00020033492200000310
Wherein, Pk(pi) Serving p for the kth iterationiScore of (a), I (p)j) Is directed to pjA set of points of (a), O (p)j) Is directed to pjQ is a damping coefficient, in order to ensure the convergence of the algorithm;
(3) if P isk(pi) Satisfy | Pk-Pk-1If the | is less than the delta, the iteration is ended. Otherwise, continuing to execute (2).
(4) The services are ranked according to their scores, and the service that caused the fault is considered to have the highest score. And in the service, further scoring the abnormal degree of the service interface calling according to the established Lasso model. The method comprises the following specific steps:
(41) for the j-th interface, the parameter ω in the Lasso model of the anomaly measure to be associated therewithiAnd normalizing the prediction residual error of the abnormal measurement to obtain a new value aiAnd bi
(42) Then the exception score for the jth interface is divided into
Figure BDA0002003349220000041
Wherein n is the number of anomaly metrics associated with the jth interface;
(43) and (3) sorting the abnormal degrees of the interfaces according to the abnormal scores of the interfaces calculated in the step (2).
The principle of the invention is as follows: aiming at the multi-language characteristics of the micro-service, monitoring the calling relation among the services by adopting an agent-based mechanism, and realizing transparent service calling monitoring on the services; when the service calls the interface, corresponding system resources are occupied, and therefore the monitored metric value shows corresponding change, and therefore an association model between the interface call and the metric value is considered to be established to depict the influence relationship between the interface call and the metric value. In order to reduce the complexity of the model and reserve the interface calling which has the most influence on the measurement, a Lasso regression method is adopted to construct an association model between the interface calling times and the measurement, abnormal measurement is found out according to the association model, and then abnormal service is found out according to the proportion of the abnormal measurement in the service; when a service is abnormal, it is likely that it will cause the service related to it to be abnormal for a period of time. Therefore, the service call topological graph is used for depicting the propagation of the abnormity among the services, and the PageRank algorithm is adopted for scoring the abnormity degree of the services and finding out the service causing the abnormity. And in the fault service, based on a regression model between interface calling and measurement, scoring the abnormal degree of the interface, and finally positioning the interface with the fault.
Compared with the prior art, the invention has the following advantages:
(1) transparent monitoring of service: the monitoring of the service calling is realized based on the proxy technology, the monitoring is transparent to the service, service developers do not need to make any modification, and the influence of the calling monitoring on the application performance can be reduced to the maximum extent.
(2) Automatic abnormal service detection: a regression model for measurement and interface calling is built based on a Lasso regression method, when the service runs, the system can automatically predict the measurement value through the regression model, if the absolute value of the residual error is larger than a threshold value, the system is considered to be abnormal, and therefore the automatic abnormal service discovery is achieved.
(3) Fault root cause positioning: and constructing a fault subgraph based on the detected abnormal service and the service calling topological graph, wherein the fault subgraph can well reflect the propagation process of the abnormality, and the PageRank algorithm is further adopted to score the abnormal degree of the service in the graph. Because the PageRank algorithm can reflect the influence degree of the nodes in the graph, services which are most likely to cause the abnormality can be found.
Drawings
FIG. 1 is a flow chart of an implementation of the method of the present invention;
FIG. 2 is an environment of use of an example method of the invention.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments and the accompanying drawings.
As shown in fig. 1, the micro-service fault diagnosis method facing abnormal propagation according to the present invention includes the following steps of (1) deploying an agent in each service instance to collect service invocation relationship and measurement monitoring data of a service, and persisting the data in a database; (2) in a cold starting stage, constructing a service calling topological graph based on collected service calling information, and constructing a Lasso regression model based on collected measurement change data and service interface calling times; (3) in the service operation stage, monitoring whether the service is abnormal or not based on the constructed Lasso regression model; (4) when the service is abnormal, the service which is most likely to cause the abnormality is found based on the PageRank algorithm, and the abnormal interface call is positioned inside the abnormal service.
As shown in fig. 2, as a use environment of the embodiment method of the present invention, a target micro-service application is Sock-Shop, and kubernets are used as a basic operation environment to deploy service instances on pod, where 10 services of a core each have one instance, a MongoDB service has three instances, and MySQL has one instance. And each pod is provided with a proxy Agent for monitoring service calling information and measurement change in the service. The load generator simulates a user request and generates a load; the fault injector injects faults into the system through a preset script so as to test the diagnosis effect of the fault diagnosis system; the fault diagnosis system performs fault diagnosis based on the collected data. The method provided by the invention is realized in a fault diagnosis system. .
The method of the embodiment of the invention comprises the following steps:
(1) collecting measurement monitoring values of each service instance through an Agent deployed in the service instance, wherein the measurement monitoring values comprise a plurality of monitoring values such as CPU utilization rate, memory occupancy rate, disk I/O rate, request number per second, service interface calling times and the like, and service request calling information;
(2) in the cold start stage, load is generated through a load generator, service request calling information is collected, and a multi-element group N is usediRecording the result in the form of (requestUID, serviceUID, span UID, parentUID, info) and adding the result into the set S;
(3) in the set S, the multi-element groups are classified according to the requestUID, the calling relation of the service in the same request is found in the multi-element group with the same requestUID, the service with the calling relation is added into a topological graph G, the point in the graph is a service instance, the calling relation of the service is represented by the edge, and if the point or the edge in the graph already exists, the point or the edge is not repeatedly added. Repeating the above process until the set S is empty;
(4) and collecting the monitoring value of the measure in the service and the calling times of the interface in the service as a response variable and an explanation variable of the Lasso regression model respectively. Wherein, using YtRepresenting the monitoring value of the metric m at the time t as a response variable of the Lasso regression model by using a vector
Figure BDA0002003349220000061
To indicate the number of calls made by service i to q interfaces within a service at time t, as an explanatory variable, where
Figure BDA0002003349220000062
Indicating that service i calls the intra-service number t at time t1The number of times of the interface is increased, and finally, the data are subjected to standardization processing;
(5) constructing a Lasso regression model based on the data, wherein the expression is as follows:
Figure BDA0002003349220000063
wherein Y istRepresents the monitored value of the metric m at the time t, p is the number of services for which the call is initiated to the service, q represents the number of interfaces within the service,
Figure BDA0002003349220000064
in order to be the regression coefficient, the method,
Figure BDA0002003349220000065
indicating that service i calls the intra-service number t at time t1α is a random error term, under a constraint condition
Figure BDA0002003349220000066
Then, the size is minimized by the coordinate descent method
Figure BDA0002003349220000067
Wherein c is an adjustment parameter;
(6) selecting adjustment parameter c by adopting generalized cross validation method, and performing generalized cross validationThe method is in the form of:
Figure BDA0002003349220000068
where rss (c) represents the sum of squared residuals:
Figure BDA0002003349220000069
Ytrepresenting the monitoring value of the metric m at the moment t, wherein p (c) is the number of effective regression coefficients in the Lasso regression, and N is the number of the monitored metrics;
(7) in the service operation process, predicting the metric value based on a Lasso regression model, and calculating a residual error:
Figure BDA00020033492200000610
wherein Y istRepresenting the monitoring value of the measurement m at the moment t, and when the absolute value of the residual error is greater than a set threshold value, determining that the measurement is abnormal, and further considering that the service is abnormal;
(8) constructing an abnormal propagation subgraph based on the service calling topological graph obtained in the step (3) and the abnormal service set obtained in the step (7), and positioning fault service by using a PageRank algorithm;
(9) in the initial stage, the proportion of abnormal measurement in the service is used as the initial value of PR of the service, and P ═ P0,p1,...,pn]TA column vector formed for a plurality of initial values of PR for a service, where piIs the proportion of the anomaly measure in service i;
(10) by the formula
Figure BDA00020033492200000611
Calculating PR value of each service, wherein q is damping coefficient, I (p)j) Is directed to pjA set of points of (a), O (p)j) Is directed to pjSet of points of (1), Pk(pi) Serving p for the kth iterationiScore of (a);
(11) after a number of iterations, when Pk(pi) Satisfy | Pk-Pk-1If the | is less than the delta, the iteration is ended;
(12) and ranking the abnormal degrees of the services according to the abnormal scores of the services, wherein the service with the highest score is the service which is most likely to cause the abnormality. Within the abnormal service, scoring the abnormal degree of the interface in the service according to the Lasso model constructed in the step (5);
(13) for the j-th interface, the parameter ω in the Lasso model of the anomaly measure to be associated therewithiAnd normalizing the prediction residual error of the abnormal measurement to obtain a new value aiAnd bi
(14) Then the exception score for the jth interface is divided into
Figure BDA0002003349220000071
Wherein n is the number of anomaly metrics associated with the jth interface;
(15) and (5) sorting the degree of abnormality of the interface according to the abnormality score obtained in the step (14). Finally, the fault root cause service and the abnormal interface in the service in the abnormality can be found out.
In a word, the invention monitors service calling information based on the agent technology, and establishes a micro-service calling topological graph to depict the abnormal propagation relation among micro-services; adopting a Lasso regression modeling interface to call and measure the correlation, and detecting abnormal micro-services by monitoring the change of a correlation model; based on the PageRank algorithm to evaluate the abnormal degree of the micro-service and the calling interface thereof, the invention realizes transparent service monitoring, automatic measurement value prediction to find abnormal service, and intelligent evaluation of the abnormal degree of the nodes in the graph to detect the root cause of the problem.

Claims (2)

1. An intelligent micro-service monitoring method facing abnormal propagation is characterized by comprising the following steps:
step one, monitoring service call: monitoring service calling information based on an agent technology, recording a service calling relationship by using a multi-element group N ═ a (requestUID, serviceUID, span UID, parentUID and info), wherein the requestUID is a request identifier and is generated at a request entrance, the serviceUID is a service identifier, the span represents one-time service calling, the span UID is a service calling span identifier, the parentUID is a parent span identifier, if the time is-1, the current span is a root span, the info is other related information contained in the span, and the info ═ a (serviceUID, startTime, timeend and duration), wherein the start time and the endTime are the start time and the end time of service calling, and the duration is the execution time of service calling, and constructing a service calling topological graph based on the monitored service calling information to depict abnormal propagation;
step two, abnormal service detection: constructing a correlation model between the calling times of the service interface and the service monitoring measurement, and detecting to obtain all abnormal services, wherein the specific steps are as follows:
(1) monitoring calling of a service interface:
Figure FDA0002459973990000011
a vector of the number of calls of q service interfaces in a service i at a time t, where
Figure FDA0002459973990000012
Indicating the number t in service i at time t1The number of service interfaces of (a);
(2) establishing a Lasso regression model based on Lasso regression resources: the independent variable of the regression model is a vector formed by the calling times of the service interface obtained in the step (1), the dependent variable is a monitoring value of a certain measurement m at the moment t, and the constructed regression model is as follows:
Figure FDA0002459973990000013
wherein
Figure FDA0002459973990000014
Is regression coefficient, α is random error term, under the constraint condition
Figure FDA0002459973990000015
Then, the solution is solved by a coordinate descent method so that
Figure FDA0002459973990000016
A very small regression coefficient and error term, c is an adjustment parameter; p is the number of services, and q is the number of interfaces in a certain service; yt is a monitoring value of a certain metric m at the moment t; i isServicing; j is the interface number in the service;
(3) detecting abnormal resources: in the service operation process, predicting a resource metric value of the service based on the Lasso regression model constructed in the step (2), and calculating a residual error:
Figure FDA0002459973990000017
wherein, Yi(t) is the monitored value of the metric,
Figure FDA0002459973990000018
the method comprises the steps that a Lasso model is used for predicting measurement, when the absolute value of a residual error is larger than a set threshold value, the measurement is determined to be abnormal, the service where the measurement is located is detected to be abnormal, and finally all services with the abnormal occurrence are detected;
thirdly, fault service diagnosis: constructing a fault propagation subgraph according to all the abnormal services detected in the second step and the service call topological graph detected in the first step, and evaluating the abnormal degree of each service by adopting a PageRank algorithm;
fourthly, in the fault service, based on the parameters of the constructed Lasso regression model
Figure FDA0002459973990000014
And a prediction residual RiAnd (t) further finding out the interface call causing the exception.
2. The intelligent microservice monitoring method for abnormal propagation according to claim 1, characterized in that: and fourthly, finding out the interface call causing the abnormity based on the parameters and the prediction parameters of the constructed Lasso regression model in the fault service, wherein the method specifically comprises the following steps:
(41) for the j-th interface, the parameter ω in the Lasso model of the anomaly measure to be associated therewithiAnd prediction residual R of anomaly measurei(t) normalizing to obtain a new value aiAnd bi
(42) Then the exception score for the jth interface is divided into
Figure FDA0002459973990000021
Wherein n is the number of anomaly metrics associated with the jth interface;
(43) and (3) sequencing the abnormal degrees of the interfaces according to the abnormal scores of the interfaces calculated in the step (2), thereby finding out the interface call causing the abnormality.
CN201910220179.4A 2019-03-22 2019-03-22 Micro-service intelligent monitoring method facing abnormal propagation Active CN109933452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910220179.4A CN109933452B (en) 2019-03-22 2019-03-22 Micro-service intelligent monitoring method facing abnormal propagation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910220179.4A CN109933452B (en) 2019-03-22 2019-03-22 Micro-service intelligent monitoring method facing abnormal propagation

Publications (2)

Publication Number Publication Date
CN109933452A CN109933452A (en) 2019-06-25
CN109933452B true CN109933452B (en) 2020-06-19

Family

ID=66988052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910220179.4A Active CN109933452B (en) 2019-03-22 2019-03-22 Micro-service intelligent monitoring method facing abnormal propagation

Country Status (1)

Country Link
CN (1) CN109933452B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427275B (en) * 2019-07-11 2022-11-18 复旦大学 Micro-service potential error and fault source prediction method based on track log learning
CN112231187B (en) * 2019-07-15 2022-07-26 华为技术有限公司 Micro-service abnormity analysis method and device
CN110442641B (en) * 2019-08-06 2022-07-12 中国工商银行股份有限公司 Link topology graph display method and device, storage medium and equipment
CN112667457A (en) * 2019-10-16 2021-04-16 烽火通信科技股份有限公司 Method and system for monitoring service call under micro-service architecture
CN110825589B (en) * 2019-11-07 2024-01-05 字节跳动有限公司 Abnormality detection method and device for micro-service system and electronic equipment
CN112817785A (en) * 2019-11-15 2021-05-18 亚信科技(中国)有限公司 Anomaly detection method and device for micro-service system
CN111190756B (en) * 2019-11-18 2023-04-28 中山大学 Root cause positioning algorithm based on call chain data
CN111309567B (en) * 2020-01-23 2024-03-29 阿里巴巴集团控股有限公司 Data processing method, device, database system, electronic equipment and storage medium
CN111597070B (en) * 2020-07-27 2020-11-27 北京必示科技有限公司 Fault positioning method and device, electronic equipment and storage medium
CN112118127B (en) * 2020-08-07 2021-11-09 中国科学院软件研究所 Service reliability guarantee method based on fault similarity
CN112698975B (en) * 2020-12-14 2022-09-27 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN112615743B (en) * 2020-12-18 2022-09-16 南京云柜网络科技有限公司 Topological graph drawing method and device
CN113190373B (en) * 2021-05-31 2022-04-05 中国人民解放军国防科技大学 Micro-service system fault root cause positioning method based on fault feature comparison
CN113626288B (en) * 2021-08-12 2023-08-25 杭州朗和科技有限公司 Fault processing method, system, device, storage medium and electronic equipment
CN114024837B (en) * 2022-01-06 2022-04-05 杭州乘云数字技术有限公司 Fault root cause positioning method of micro-service system
CN114598742A (en) * 2022-03-04 2022-06-07 北京北信源软件股份有限公司 Micro-service importance determination method, device, electronic equipment and storage medium
CN115314559B (en) * 2022-08-03 2023-09-29 苏州创意云网络科技有限公司 Network service system, abnormal response method thereof, service unit, scheduling processing unit, electronic device and computer storage medium
CN115396341B (en) * 2022-08-16 2023-12-05 度小满科技(北京)有限公司 Service stability evaluation method and device, storage medium and electronic device
CN117520040B (en) * 2024-01-05 2024-03-08 中国民航大学 Micro-service fault root cause determining method, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103986625A (en) * 2014-05-29 2014-08-13 中国科学院软件研究所 Cloud application fault diagnosis system based on statistical monitoring
CN107766205A (en) * 2017-10-10 2018-03-06 武汉大学 A kind of monitoring system and method towards the tracking of micro services invoked procedure
CN108322351A (en) * 2018-03-05 2018-07-24 北京奇艺世纪科技有限公司 Generate method and apparatus, fault determination method and the device of topological diagram
CN108762908A (en) * 2018-05-31 2018-11-06 阿里巴巴集团控股有限公司 System calls method for detecting abnormality and device
CN109144724A (en) * 2018-07-27 2019-01-04 众安信息技术服务有限公司 A kind of micro services resource scheduling system and method
CN109213616A (en) * 2018-09-25 2019-01-15 江苏润和软件股份有限公司 A kind of micro services software systems method for detecting abnormality based on calling map analysis
CN109254865A (en) * 2018-09-25 2019-01-22 江苏润和软件股份有限公司 A kind of cloud data center based on statistical analysis services abnormal root because of localization method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9599531B1 (en) * 2015-12-21 2017-03-21 International Business Machines Corporation Topological connectivity and relative distances from temporal sensor measurements of physical delivery system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103986625A (en) * 2014-05-29 2014-08-13 中国科学院软件研究所 Cloud application fault diagnosis system based on statistical monitoring
CN107766205A (en) * 2017-10-10 2018-03-06 武汉大学 A kind of monitoring system and method towards the tracking of micro services invoked procedure
CN108322351A (en) * 2018-03-05 2018-07-24 北京奇艺世纪科技有限公司 Generate method and apparatus, fault determination method and the device of topological diagram
CN108762908A (en) * 2018-05-31 2018-11-06 阿里巴巴集团控股有限公司 System calls method for detecting abnormality and device
CN109144724A (en) * 2018-07-27 2019-01-04 众安信息技术服务有限公司 A kind of micro services resource scheduling system and method
CN109213616A (en) * 2018-09-25 2019-01-15 江苏润和软件股份有限公司 A kind of micro services software systems method for detecting abnormality based on calling map analysis
CN109254865A (en) * 2018-09-25 2019-01-22 江苏润和软件股份有限公司 A kind of cloud data center based on statistical analysis services abnormal root because of localization method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure;Sigelman Benjamin H.等;《Google Technical Report》;20100430;第1-14页 *

Also Published As

Publication number Publication date
CN109933452A (en) 2019-06-25

Similar Documents

Publication Publication Date Title
CN109933452B (en) Micro-service intelligent monitoring method facing abnormal propagation
Wu et al. Microrca: Root cause localization of performance issues in microservices
Gainaru et al. Fault prediction under the microscope: A closer look into HPC systems
Jiang et al. Modeling and tracking of transaction flow dynamics for fault detection in complex systems
Li et al. Practical root cause localization for microservice systems via trace analysis
US9280436B2 (en) Modeling a computing entity
EP3745272B1 (en) An application performance analyzer and corresponding method
Zhang et al. Ensembles of models for automated diagnosis of system performance problems
Notaro et al. A survey of aiops methods for failure management
US8655623B2 (en) Diagnostic system and method
CN112698975A (en) Fault root cause positioning method and system of micro-service architecture information system
Soualhia et al. Infrastructure fault detection and prediction in edge cloud environments
Jiang et al. Efficient fault detection and diagnosis in complex software systems with information-theoretic monitoring
Wu et al. Performance diagnosis in cloud microservices using deep learning
CN115348159B (en) Micro-service fault positioning method and device based on self-encoder and service dependency graph
Ehlers et al. A self-adaptive monitoring framework for component-based software systems
CN114201326A (en) Micro-service abnormity diagnosis method based on attribute relation graph
Li et al. Fighting the fog of war: Automated incident detection for cloud systems
Fu et al. Performance issue diagnosis for online service systems
CN115118621B (en) Dependency graph-based micro-service performance diagnosis method and system
Wu et al. Causal inference techniques for microservice performance diagnosis: Evaluation and guiding recommendations
Pan et al. Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space
Yu et al. TraceRank: Abnormal service localization with dis‐aggregated end‐to‐end tracing data in cloud native systems
Ding et al. Online prediction and improvement of reliability for service oriented systems
Munawar et al. Adaptive monitoring in enterprise software systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant