CN109933452B

CN109933452B - Micro-service intelligent monitoring method facing abnormal propagation

Info

Publication number: CN109933452B
Application number: CN201910220179.4A
Authority: CN
Inventors: 王焘; 张文博; 薛晓东
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2019-03-22
Filing date: 2019-03-22
Publication date: 2020-06-19
Anticipated expiration: 2039-03-22
Also published as: CN109933452A

Abstract

The invention relates to an intelligent monitoring method of micro-services facing abnormal propagation, which monitors service calling information based on a proxy technology and establishes a topological graph for calling the micro-services to depict the abnormal propagation relation among the micro-services; adopting a Lasso regression modeling interface to call and measure the correlation, and detecting abnormal micro-services by monitoring the change of a correlation model; based on the PageRank algorithm to evaluate the abnormal degree of the micro-service and the calling interface thereof, the invention realizes transparent service monitoring, automatic measurement value prediction to find abnormal service, and intelligent evaluation of the abnormal degree of the nodes in the graph to detect the root cause of the problem.

Description

Micro-service intelligent monitoring method facing abnormal propagation

Technical Field

The invention relates to a fault diagnosis method of a micro-service software system, in particular to an intelligent monitoring method of micro-service for abnormal propagation, belonging to the technical field of software.

Background

The monolithic architecture and the SOA software architecture are the architecture forms commonly adopted by software companies, and through the development of over a decade, software systems become abnormally complex, have low expansibility and maintainability, and bear heavy technical debts. The existing internet is competitive, user requirements and market environments are in rapid changes all the time, when the existing internet is applied, the expansibility and flexibility of the traditional software architecture form are obviously insufficient, and the design, development, test, operation and maintenance costs are obviously increased. Therefore, the concept of microservice is proposed, which is a software architecture that treats a single application as a suite of software services, each running in a separate process, communicating with each other via lightweight protocols. The characteristics of the micro-service architecture are very suitable for agile development and continuous integration, the pain of the traditional software architecture is solved, and the extensive attention and research of academia and industry are obtained.

After the software system is micro-serviced, the maintenance and the flexibility are improved, meanwhile, the dependency relationship between services is complicated, and the probability of occurrence of faults and the loss caused by the faults are increased. For example, in a high traffic website, a delay of a certain service component may cause all application resources to be exhausted, causing a so-called avalanche effect, which may seriously cause the whole system to be broken down. Therefore, the effective monitoring system and the rapid positioning of the fault cause are one of the key technologies for guaranteeing the reliability and the performance of the micro-service.

The following categories of work are mainly done for microservice fault diagnosis: (1) a diagnostic method based on metric monitoring. The method mainly collects system operation indexes such as CPU, memory, network and the like so as to reflect the current state of the application program and the operation trend in a period of time. If a certain measure exceeds a preset threshold, the system is in a problem and an alarm is triggered, then an administrator solves the problem by taking monitoring data as a basis and combining self experience (Wang T, Zhang W, Ye C, Wei J, Zhong H, Huang T.FDC: Automatic factory Diagnosis Framework for Web application protocol IEEE Transactions on Systems, Man, and Cybernetics: s.2016,46(1): 61-75; M.Farshchi, J.G.Schneider, I.Weber, and J.Grundy, "method output and analysis detection for closed operations and simulation analysis," Journal of system and 2018,137 log 549); (2) the log-based monitoring and analyzing method has the advantages that the log clearly records the operation condition of the system, is convenient to persist and can be easily searched, and is generally an effective means (ELK. https:// www.elastic.co /) for finding out the cause of the fault and supporting more business targets; (3) a monitoring and diagnosing method Based on distributed request tracing includes obtaining request Execution path by label-Based method, finding out system fault by analyzing Execution path or comparing path (A. Nandi, A. Mandal, S. Atreja, G.B. Dasgugpta, and S.Bhattacharya, "analysis detection Using Program Control Flow Graph calculation Logs,"22nd ACM SIGKDD International reference on Knowledge Discovery and Data Mining, San Francis, California, USA, 2016; T.Jia, P.N, L.Yang, Y.Li, F.Meng and J.xu, "An application for analysis Based on graphics networking service log, IEEE 25. report 32). The monitoring fault diagnosis mode based on measurement and log is simple to realize, but the overall state of the system cannot be reflected, the service flow cannot be tracked, the fault location level is usually a service component, and in a complex micro-service interaction relationship, a manager consumes a large amount of time to search and locate problems; the monitoring and diagnosing method based on distributed request tracking monitors the request track through a log or a code implantation mode to be used as a reference for fault diagnosis, but the method has low monitoring expansibility, cannot be transparent to application, and does not consider the problem of abnormal propagation.

Disclosure of Invention

The technical problem of the invention is solved: the defects of the prior art are overcome, and the micro-service-oriented high-efficiency fault diagnosis system is provided. By transparently calling and monitoring the service, the expansibility of the system is improved, and the influence of monitoring on the operation of the micro-service is reduced; and fine-grained fault root cause positioning at the interface level is realized by analyzing the monitoring data.

The technical scheme of the invention is as follows: an intelligent monitoring method for micro-services facing abnormal propagation comprises the following implementation steps:

step one, monitoring service call: monitoring service invocation information based on proxy technology, using tuple N_iRecording a service call relation as (requestUID, serviceUID, span uid, parenuid, info), wherein the requestUID is a request identifier and is generated at a request entrance; the serviceUID is a service identifier; the span UID is a service calling span identifier; if the parentUID is a parent span identifier, if the parentUID is-1, the current span is a root span; the info contains other information, represented by tuple info ═ duration, where serviceUID is uniquely identified by service component and instance number; startTime and endTime are service invocation start and end times; duration is service callAnd executing the time. Based on the monitored service invocation information, the specific process of constructing the service invocation topological graph is as follows:

(1) in the initial stage, the topological graph G is empty, and the set S contains the collected calling information;

(2) taking out the tuples belonging to the same request and having a calling relationship from the set S, taking the service instance represented by the serviceUID in the tuples as a point, adding the calling relationship into G as a directed edge, and if the point or the edge already exists, not repeatedly adding;

(3) if the set S is not empty, then execution continues with (2). Otherwise, the algorithm ends.

Step two, abnormal service detection: constructing a correlation model between the number of times of calling the interfaces in the service and the service monitoring measurement, and specifically comprising the following steps:

(1) monitoring data of the intra-service measurement and data of the number of calls of all interfaces in the service are collected. For a metric m within a certain service S, a vector is used

Representing the number of calls made by service i to q interfaces within the service at time t, where

Indicating that service i calls the intra-service number t at time t₁The number of interfaces of (1) is normalized and used as an explanatory variable of the Lasso regression model. By Y_tRepresenting the monitoring value of the metric m at the moment t, and taking the monitoring value as a response variable of the Lasso regression model;

(2) and (3) constructing a Lasso regression model based on the data, wherein the independent variable of the model is a vector formed by the calling times of the service interface obtained in the step (1), and the dependent variable is a monitoring value of a certain measurement m at the moment t. The regression model further constructed was:

wherein

For the regression coefficients, α is the random error term

Then, the method is determined by a coordinate descent method

Minimized regression coefficients and error terms;

(3) and selecting an adjusting parameter t by adopting a generalized cross-validation method, wherein the generalized cross-validation method has the form:

where rss (c) represents the sum of squared residuals:

p (c) is the number of effective regression coefficients in the Lasso regression;

(4) in the service operation process, predicting the metric value based on a Lasso regression model, and calculating a residual error:

when the absolute value of the residual error is larger than a set threshold value, determining that the measurement is abnormal, and further considering that the service is abnormal;

thirdly, fault service diagnosis: and constructing fault propagation subgraphs for all abnormal services according to the calling relations of the abnormal services based on the data obtained in the first two steps. In the subgraph, the PageRank algorithm is used for scoring the abnormal degree of each service, and the specific steps are as follows:

(1) in the initial stage, the proportion of abnormal measurement in service is used as the initial value of PR for the service, and P ═ P₀,p₁,...,p_n]^TA column vector formed for a plurality of initial values of PR for a service, where p_iIs the proportion of the anomaly measure in service i;

(2) computing service p_iHas a PR value of

Wherein, P^k(p_i) Serving p for the kth iteration_iScore of (a), I (p)_j) Is directed to p_jA set of points of (a), O (p)_j) Is directed to p_jQ is a damping coefficient, in order to ensure the convergence of the algorithm;

(3) if P is^k(p_i) Satisfy | P^k-P^k-1If the | is less than the delta, the iteration is ended. Otherwise, continuing to execute (2).

(4) The services are ranked according to their scores, and the service that caused the fault is considered to have the highest score. And in the service, further scoring the abnormal degree of the service interface calling according to the established Lasso model. The method comprises the following specific steps:

(41) for the j-th interface, the parameter ω in the Lasso model of the anomaly measure to be associated therewith_iAnd normalizing the prediction residual error of the abnormal measurement to obtain a new value a_iAnd b_i；

(42) Then the exception score for the jth interface is divided into

Wherein n is the number of anomaly metrics associated with the jth interface;

(43) and (3) sorting the abnormal degrees of the interfaces according to the abnormal scores of the interfaces calculated in the step (2).

The principle of the invention is as follows: aiming at the multi-language characteristics of the micro-service, monitoring the calling relation among the services by adopting an agent-based mechanism, and realizing transparent service calling monitoring on the services; when the service calls the interface, corresponding system resources are occupied, and therefore the monitored metric value shows corresponding change, and therefore an association model between the interface call and the metric value is considered to be established to depict the influence relationship between the interface call and the metric value. In order to reduce the complexity of the model and reserve the interface calling which has the most influence on the measurement, a Lasso regression method is adopted to construct an association model between the interface calling times and the measurement, abnormal measurement is found out according to the association model, and then abnormal service is found out according to the proportion of the abnormal measurement in the service; when a service is abnormal, it is likely that it will cause the service related to it to be abnormal for a period of time. Therefore, the service call topological graph is used for depicting the propagation of the abnormity among the services, and the PageRank algorithm is adopted for scoring the abnormity degree of the services and finding out the service causing the abnormity. And in the fault service, based on a regression model between interface calling and measurement, scoring the abnormal degree of the interface, and finally positioning the interface with the fault.

Compared with the prior art, the invention has the following advantages:

(1) transparent monitoring of service: the monitoring of the service calling is realized based on the proxy technology, the monitoring is transparent to the service, service developers do not need to make any modification, and the influence of the calling monitoring on the application performance can be reduced to the maximum extent.

(2) Automatic abnormal service detection: a regression model for measurement and interface calling is built based on a Lasso regression method, when the service runs, the system can automatically predict the measurement value through the regression model, if the absolute value of the residual error is larger than a threshold value, the system is considered to be abnormal, and therefore the automatic abnormal service discovery is achieved.

(3) Fault root cause positioning: and constructing a fault subgraph based on the detected abnormal service and the service calling topological graph, wherein the fault subgraph can well reflect the propagation process of the abnormality, and the PageRank algorithm is further adopted to score the abnormal degree of the service in the graph. Because the PageRank algorithm can reflect the influence degree of the nodes in the graph, services which are most likely to cause the abnormality can be found.

Drawings

FIG. 1 is a flow chart of an implementation of the method of the present invention;

FIG. 2 is an environment of use of an example method of the invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments and the accompanying drawings.

As shown in fig. 1, the micro-service fault diagnosis method facing abnormal propagation according to the present invention includes the following steps of (1) deploying an agent in each service instance to collect service invocation relationship and measurement monitoring data of a service, and persisting the data in a database; (2) in a cold starting stage, constructing a service calling topological graph based on collected service calling information, and constructing a Lasso regression model based on collected measurement change data and service interface calling times; (3) in the service operation stage, monitoring whether the service is abnormal or not based on the constructed Lasso regression model; (4) when the service is abnormal, the service which is most likely to cause the abnormality is found based on the PageRank algorithm, and the abnormal interface call is positioned inside the abnormal service.

As shown in fig. 2, as a use environment of the embodiment method of the present invention, a target micro-service application is Sock-Shop, and kubernets are used as a basic operation environment to deploy service instances on pod, where 10 services of a core each have one instance, a MongoDB service has three instances, and MySQL has one instance. And each pod is provided with a proxy Agent for monitoring service calling information and measurement change in the service. The load generator simulates a user request and generates a load; the fault injector injects faults into the system through a preset script so as to test the diagnosis effect of the fault diagnosis system; the fault diagnosis system performs fault diagnosis based on the collected data. The method provided by the invention is realized in a fault diagnosis system. .

The method of the embodiment of the invention comprises the following steps:

(1) collecting measurement monitoring values of each service instance through an Agent deployed in the service instance, wherein the measurement monitoring values comprise a plurality of monitoring values such as CPU utilization rate, memory occupancy rate, disk I/O rate, request number per second, service interface calling times and the like, and service request calling information;

(2) in the cold start stage, load is generated through a load generator, service request calling information is collected, and a multi-element group N is used_iRecording the result in the form of (requestUID, serviceUID, span UID, parentUID, info) and adding the result into the set S;

(3) in the set S, the multi-element groups are classified according to the requestUID, the calling relation of the service in the same request is found in the multi-element group with the same requestUID, the service with the calling relation is added into a topological graph G, the point in the graph is a service instance, the calling relation of the service is represented by the edge, and if the point or the edge in the graph already exists, the point or the edge is not repeatedly added. Repeating the above process until the set S is empty;

(4) and collecting the monitoring value of the measure in the service and the calling times of the interface in the service as a response variable and an explanation variable of the Lasso regression model respectively. Wherein, using Y_tRepresenting the monitoring value of the metric m at the time t as a response variable of the Lasso regression model by using a vector

To indicate the number of calls made by service i to q interfaces within a service at time t, as an explanatory variable, where

Indicating that service i calls the intra-service number t at time t₁The number of times of the interface is increased, and finally, the data are subjected to standardization processing;

(5) constructing a Lasso regression model based on the data, wherein the expression is as follows:

wherein Y is_tRepresents the monitored value of the metric m at the time t, p is the number of services for which the call is initiated to the service, q represents the number of interfaces within the service,

in order to be the regression coefficient, the method,

indicating that service i calls the intra-service number t at time t₁α is a random error term, under a constraint condition

Then, the size is minimized by the coordinate descent method

Wherein c is an adjustment parameter;

(6) selecting adjustment parameter c by adopting generalized cross validation method, and performing generalized cross validationThe method is in the form of:

where rss (c) represents the sum of squared residuals:

Y_trepresenting the monitoring value of the metric m at the moment t, wherein p (c) is the number of effective regression coefficients in the Lasso regression, and N is the number of the monitored metrics;

(7) in the service operation process, predicting the metric value based on a Lasso regression model, and calculating a residual error:

wherein Y is_tRepresenting the monitoring value of the measurement m at the moment t, and when the absolute value of the residual error is greater than a set threshold value, determining that the measurement is abnormal, and further considering that the service is abnormal;

(8) constructing an abnormal propagation subgraph based on the service calling topological graph obtained in the step (3) and the abnormal service set obtained in the step (7), and positioning fault service by using a PageRank algorithm;

(9) in the initial stage, the proportion of abnormal measurement in the service is used as the initial value of PR of the service, and P ═ P₀,p₁,...,p_n]^TA column vector formed for a plurality of initial values of PR for a service, where p_iIs the proportion of the anomaly measure in service i;

(10) by the formula

Calculating PR value of each service, wherein q is damping coefficient, I (p)_j) Is directed to p_jA set of points of (a), O (p)_j) Is directed to p_jSet of points of (1), P^k(p_i) Serving p for the kth iteration_iScore of (a);

(11) after a number of iterations, when P^k(p_i) Satisfy | P^k-P^k-1If the | is less than the delta, the iteration is ended;

(12) and ranking the abnormal degrees of the services according to the abnormal scores of the services, wherein the service with the highest score is the service which is most likely to cause the abnormality. Within the abnormal service, scoring the abnormal degree of the interface in the service according to the Lasso model constructed in the step (5);

(13) for the j-th interface, the parameter ω in the Lasso model of the anomaly measure to be associated therewith_iAnd normalizing the prediction residual error of the abnormal measurement to obtain a new value a_iAnd b_i；

(14) Then the exception score for the jth interface is divided into

Wherein n is the number of anomaly metrics associated with the jth interface;

(15) and (5) sorting the degree of abnormality of the interface according to the abnormality score obtained in the step (14). Finally, the fault root cause service and the abnormal interface in the service in the abnormality can be found out.

In a word, the invention monitors service calling information based on the agent technology, and establishes a micro-service calling topological graph to depict the abnormal propagation relation among micro-services; adopting a Lasso regression modeling interface to call and measure the correlation, and detecting abnormal micro-services by monitoring the change of a correlation model; based on the PageRank algorithm to evaluate the abnormal degree of the micro-service and the calling interface thereof, the invention realizes transparent service monitoring, automatic measurement value prediction to find abnormal service, and intelligent evaluation of the abnormal degree of the nodes in the graph to detect the root cause of the problem.

Claims

1. An intelligent micro-service monitoring method facing abnormal propagation is characterized by comprising the following steps:

step one, monitoring service call: monitoring service calling information based on an agent technology, recording a service calling relationship by using a multi-element group N ═ a (requestUID, serviceUID, span UID, parentUID and info), wherein the requestUID is a request identifier and is generated at a request entrance, the serviceUID is a service identifier, the span represents one-time service calling, the span UID is a service calling span identifier, the parentUID is a parent span identifier, if the time is-1, the current span is a root span, the info is other related information contained in the span, and the info ═ a (serviceUID, startTime, timeend and duration), wherein the start time and the endTime are the start time and the end time of service calling, and the duration is the execution time of service calling, and constructing a service calling topological graph based on the monitored service calling information to depict abnormal propagation;

step two, abnormal service detection: constructing a correlation model between the calling times of the service interface and the service monitoring measurement, and detecting to obtain all abnormal services, wherein the specific steps are as follows:

(1) monitoring calling of a service interface:

a vector of the number of calls of q service interfaces in a service i at a time t, where

Indicating the number t in service i at time t₁The number of service interfaces of (a);

(2) establishing a Lasso regression model based on Lasso regression resources: the independent variable of the regression model is a vector formed by the calling times of the service interface obtained in the step (1), the dependent variable is a monitoring value of a certain measurement m at the moment t, and the constructed regression model is as follows:

wherein

Is regression coefficient, α is random error term, under the constraint condition

Then, the solution is solved by a coordinate descent method so that

A very small regression coefficient and error term, c is an adjustment parameter; p is the number of services, and q is the number of interfaces in a certain service; yt is a monitoring value of a certain metric m at the moment t; i isServicing; j is the interface number in the service;

(3) detecting abnormal resources: in the service operation process, predicting a resource metric value of the service based on the Lasso regression model constructed in the step (2), and calculating a residual error:

wherein, Y_i(t) is the monitored value of the metric,

the method comprises the steps that a Lasso model is used for predicting measurement, when the absolute value of a residual error is larger than a set threshold value, the measurement is determined to be abnormal, the service where the measurement is located is detected to be abnormal, and finally all services with the abnormal occurrence are detected;

thirdly, fault service diagnosis: constructing a fault propagation subgraph according to all the abnormal services detected in the second step and the service call topological graph detected in the first step, and evaluating the abnormal degree of each service by adopting a PageRank algorithm;

fourthly, in the fault service, based on the parameters of the constructed Lasso regression model

And a prediction residual R_iAnd (t) further finding out the interface call causing the exception.

2. The intelligent microservice monitoring method for abnormal propagation according to claim 1, characterized in that: and fourthly, finding out the interface call causing the abnormity based on the parameters and the prediction parameters of the constructed Lasso regression model in the fault service, wherein the method specifically comprises the following steps:

(41) for the j-th interface, the parameter ω in the Lasso model of the anomaly measure to be associated therewith_iAnd prediction residual R of anomaly measure_i(t) normalizing to obtain a new value a_iAnd b_i；

(42) Then the exception score for the jth interface is divided into

Wherein n is the number of anomaly metrics associated with the jth interface;

(43) and (3) sequencing the abnormal degrees of the interfaces according to the abnormal scores of the interfaces calculated in the step (2), thereby finding out the interface call causing the abnormality.