CN109933452A

CN109933452A - A kind of micro services intelligent monitoring method towards anomalous propagation

Info

Publication number: CN109933452A
Application number: CN201910220179.4A
Authority: CN
Inventors: 王焘; 张文博; 薛晓东
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2019-03-22
Filing date: 2019-03-22
Publication date: 2019-06-25
Anticipated expiration: 2039-03-22
Also published as: CN109933452B

Abstract

The present invention relates to a kind of, and the micro services intelligent monitoring method towards anomalous propagation is established micro services and is called topological diagram to portray anomalous propagation relationship between micro services based on agent skill group monitoring service invocation information；It is called using Lasso regression modeling interface and is associated between measurement, the variation by monitoring correlation model detects abnormal micro services；Intensity of anomaly based on PageRank algorithm evaluation micro services and its calling interface, the present invention realize transparence service monitoring, the prediction of Automatic Measurement value with the service of noting abnormalities, the intensity of anomaly of intelligent assessment figure interior joint with test problems root because.

Description

A kind of micro services intelligent monitoring method towards anomalous propagation

Technical field

The present invention relates to the method for diagnosing faults of micro services software systems more particularly to it is a kind of towards anomalous propagation in incognito Business intelligent monitoring method, belongs to software technology field.

Background technique

Monomer-type framework and SOA software architecture are the architectural forms that software company generallys use, by the development of more than ten years, The complex that software systems have become, scalability is very low with maintainability, and heavy technology debt has been born by enterprise.Current interconnection Net dog-eat-dog, user demand and market environment moment are in quickly variation, when facing current Internet application, The scalability of conventional software architectural form and flexibility are obviously insufficient, and design, exploitation, test and O&M cost significantly increase Add.Therefore, the concept of micro services is suggested, and micro services are a kind of using single application program as the soft of one group of software service external member Part architectural form, each service operation are communicated in independent process each other by lightweight protocol.The spy of micro services framework Property is very suitable to agile development and continuous integrating, solves the pain spot of conventional software architectural, obtains academia and industry Extensive concern and research.

After software systems micro services, improving maintainability and while flexibility, but make between service according to The relationship of relying is intricate, increases failure odds and the loss of failure bring.Such as in the website of a high flow capacity, Some serviced component once postpones, and may cause all application resources and is depleted, so-called avalanche effect is caused, when serious Whole system can be caused to paralyse.Therefore system is effectively monitored, and quick positioning failure is the reason is that ensure micro services reliability and performance One of key technology.

Mainly there are following a few classes for the work of micro services fault diagnosis: (1) based on the diagnostic method of monitoring metrics.The party Method is mainly collection system operating index, such as CPU, memory, network etc., when reflecting application program current state and one section with this Interior operation trend.If a certain measurement is more than preset threshold values, then it represents that system there is a problem, and trigger alarm, so Afterwards, administrator is using monitoring data as foundation, solved the problems, such as in conjunction with the experience of itself (Wang T, Zhang W, Ye C, Wei J, Zhong H, Huang T.FD4C:Automatic Fault Diagnosis Framework for Web Applications In Cloud Computing.IEEE Transactions on Systems, Man, and Cybernetics: Systems.2016,46(1):61-75；M.Farshchi,J.G.Schneider,I.Weber,and J.Grundy, “Metric selection and anomaly detection for cloud operations using log and metric correlation analysis,”Journal of Systems and Software,2018,137,pp.531- 549.)；(2) based on the method for monitoring and analyzing of log, log has explicitly recorded the operating condition of system, is convenient for persistence, and And can easily search for, it is usually the effective means found out failure cause and support more business target (ELK.https://www.elastic.co/)；(3) the monitoring, diagnosing method based on distributed request tracking, by based on mark The execution route of the method acquisition request of note, compares by the analysis to execution route or by path, Lai Faxian system Failure (A.Nandi, A.Mandal, S.Atreja, G.B.Dasgupta, and S.Bhattacharya, " Anomaly Detection Using Program Control Flow Graph Mining From Execution Logs,"22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco,California,USA,2016；T.Jia,P.Chen,L.Yang,Y.Li,F.Meng and J.Xu," An Approach for Anomaly Diagnosis Based on Hybrid Graph Model with Logs for Distributed Services,"IEEE International Conference on Web Services,Honolulu, HI,2017,pp.25-32.).Wherein realizes simply based on measurement, the monitoring fault diagnosis mode of log, but cannot reflect system Integrality, Business Stream can not be tracked, the rank of fault location is usually serviced component, in complicated micro services interactive relation In, administrator will take a substantial amount of time to search and orientation problem；And the monitoring, diagnosing method based on distributed request tracking Monitor reference of the track as fault diagnosis of request by way of log or code implant, but the expansion of the method monitoring Malleability is lower, can not accomplish, also do not account for anomalous propagation the problem of transparent to application.

Summary of the invention

Technology of the invention solves the problems, such as: overcoming the deficiencies of the prior art and provide a kind of Efficient fault towards micro services Diagnostic system.By the calling monitoring to service-transparency, the scalability of system is improved, reduces the shadow that monitoring runs micro services It rings；By research and application data, realize the fine granularity failure root of interface level because of positioning.

The technology of the present invention solution: a kind of micro services intelligent monitoring method towards anomalous propagation realizes that steps are as follows:

Service call monitoring: the first step monitors service invocation information based on agent skill group, with multi-component system N_i= (requestUID, serviceUID, spanUID, parentUID, info) Lai Jilu service call relationship, wherein RequestUID is request identifier, is generated in request inlet；ServiceUID is service identifier；SpanUID is service Call span identifier；ParentUID is father span identifier, if it is -1, indicates that current span is root span；Info includes Other information indicates with multi-component system info=(serviceUID, startTime, endTime, duration), wherein ServiceUID is by serviced component and example number unique identification；StartTime and endTime is that service call starts, terminates Time；Duration is the execution time of service call.Based on the above-mentioned service invocation information monitored, constructs service call and open up Flutterring figure, detailed process is as follows:

(1) initial stage, topological diagram G are sky, include collected recalls information in set S；

(2) tuple that is belonging to same request and having call relation is taken out from set S, by the serviceUID in tuple Representative Service Instance is added in G as point, call relation as directed edge, if fruit dot or side have existed, is not repeated Addition；

(3) if set S is not sky, (2) are continued to execute.Otherwise, algorithm terminates.

Exception service detection: second step is associated with mould between building service inner joint call number and service monitoring measurement Type, the specific steps are as follows:

(1) data of total interface call number in collecting the monitoring data measured in service and servicing.For some It services for the measurement m in S, uses vectorIt indicates in moment t, service i is to q in the service The call number of interface, whereinIndicating that moment t service i calls number in the service is t₁Interface number, it is marked Quasi-ization processing, the explanatory variable as Lasso regression model.Use Y_tIndicate that measurement m in the monitor value of moment t, is returned as Lasso Return the response variable of model；

(2) Lasso regression model is constructed based on above-mentioned data, the independent variable of model is the service interface obtained by (1) The vector that call number is constituted, dependent variable are monitor value of some measurement m in moment t.The regression model further constructed are as follows:WhereinFor regression coefficient, α is stochastic error.In constraint conditionUnder, pass through Coordinate descent find out so thatThe regression coefficient and error term of minimization；

(3) adjusting parameter t, the form of Generalized Cross Validation method are selected using Generalized Cross Validation method are as follows:Wherein RSS (c) indicates residual sum of squares (RSS):P (c) is The number of effective regression coefficient in Lasso recurrence；

(4) during service operation, metric is predicted based on Lasso regression model, calculates residual error:When residual absolute value is greater than the threshold value of setting, it is abnormal to assert that measurement occurs, and then think to service Occur abnormal；

Third step, failed services diagnosis: the data obtained based on first two steps, by occurred abnormal service according to its tune Fault propagation subgraph is constructed with relationship.In subgraph, given a mark using intensity of anomaly of the PageRank algorithm to each service, Specific step is as follows:

(1) initial stage uses PR initial value of the ratio of exception measurement in servicing as the service, P=[p₀,p₁,...,p_n]^T For the column vector that the PR initial value of multiple services is constituted, wherein p_iFor the ratio of exception measurement in service i；

(2) service p is calculated_iPR value beWherein, P^k(p_i) it is kth time Iteration services p_iScore, I (p_j) it is to be directed toward p_jPoint set, O (p_j) it is to be directed toward p_jPoint set, q is damped coefficient, mesh Be guarantee algorithm convergence；

(3) if P^k(p_i) meet | P^k-P^k-1| < δ, then iteration terminates.Otherwise, (2) are continued to execute.

(4) it is ranked up according to the score of service, it is believed that highest scoring is exactly the service for causing failure.In service Portion gives a mark according to the intensity of anomaly that the Lasso model of foundation further calls service interface.Specific step is as follows:

(41) for j-th of interface, by the parameter ω in the Lasso model of relative exception measurement_iAnd it is abnormal The prediction residual of measurement is normalized, and obtains new value a_iAnd b_i；

(42) then the exception of j-th of interface is scored atWherein n is and j-th of interface related exception The number of measurement；

(43) according to the abnormal score of the interface calculated in (2), the intensity of anomaly of interface is ranked up.

The principle of the present invention: for the multilingual characteristic of micro services, using between the mechanism monitoring service based on agency Call relation is realized and is monitored to the service call of service-transparency；When service carries out interface calling, corresponding system money can be occupied Source, therefore the metric monitored can show to change accordingly, therefore consider that establishing interface calls and being associated between metric Model portrays influence relationship between the two.In order to reduce the complexity of model, retain on the most influential interface tune of measurement With constructing the correlation model between interface call number and measurement using Lasso homing method, and found out according to the correlation model Abnormal measurement, then the ratio according to shared by exception measurement in service, finds out and abnormal service occurs；When some service occurs When abnormal, it is likely to that relative service is caused exception also occur whithin a period of time.Therefore it is opened up with service call It flutters figure and gives a mark to the intensity of anomaly of service using PageRank algorithm come propagation abnormal between the service of portraying, find out and draw Send out abnormal service.Inside failed services, the regression model between measurement is called based on interface, to the intensity of anomaly of interface It gives a mark, finally positions the interface to break down.

The invention has the following advantages over the prior art:

(1) service-transparency monitors: the monitoring to service call is realized based on agent skill group, is accomplished to monitor to service-transparency, Business development personnel can reduce the influence for calling monitoring to application performance without making any modification to greatest extent.

(2) automation exception service detection: the regression model called based on Lasso homing method building measurement with interface, In service operation, system can be predicted metric automatically by regression model, if residual absolute value of the difference is greater than threshold Value, then it is assumed that exception occur, to realize the automation service of noting abnormalities.

(3) failure root is because of positioning: failure subgraph is constructed based on the exception service detected and service call topological diagram, therefore Hedge figure can be well reflected abnormal communication process, further using PageRank algorithm to the intensity of anomaly serviced in figure It gives a mark.Because PageRank algorithm can reflect the influence degree of figure interior joint, most possible initiation can be found out Abnormal service.

Detailed description of the invention

Fig. 1 is the implementation flow chart of the method for the present invention；

Fig. 2 is the use environment of present example method.

Specific embodiment

Below in conjunction with specific implementation example and attached drawing, the present invention is described in detail.

As shown in Figure 1, the micro services method for diagnosing faults proposed by the present invention towards anomalous propagation, includes the following steps, (1) agency is deployed in each Service Instance, to collect the monitoring metrics data of service call relationship and service, and will be counted According to being persisted in database；(2) service call topology is constructed based on the service invocation information being collected into cold-start phase Figure, and Lasso regression model is constructed based on the measurement delta data being collected into and service interface call number；(3) it is servicing Whether operation phase, the Lasso regression model monitoring service based on building are abnormal；(4) it when service occurs abnormal, is based on PageRank algorithm finds out the most possible service for causing exception, and calls in the interface of exception service positioned internal exception.

As shown in Fig. 2, the use environment as embodiment method of the present invention, target micro services application is Sock-Shop, Using Kubernetes as basic running environment, Service Instance is deployed on pod, wherein the 10 of core service is each own One example, there are three example, MySQL has an example for MongoDB service.One is disposed on each pod acts on behalf of Agent, For monitoring measurement variation in service invocation information and service.The request of workload generator analog subscriber, generates load；Failure note Enter device by preset script, by direct fault location into system, with the diagnosis effect of test failure diagnostic system；Fault diagnosis system It unites and carries out fault diagnosis based on the data being collected into.Method proposed by the invention is realized in fault diagnosis system.

Embodiment method flow of the present invention:

(1) by the monitoring metrics value acted on behalf of Agent and collect each Service Instance being deployed in Service Instance, including Cpu busy percentage, multiple monitor values such as memory usage, magnetic disc i/o rate, number of request per second, service inner joint call number, with And service request recalls information；

(2) it in cold-start phase, is generated and is loaded by workload generator, collected service request recalls information, use multi-component system N_iThe form of=(requestUID, serviceUID, spanUID, parentUID, info) is recorded, and is added in set In S；

(3) in set S, classify according to requestUID to multi-component system, in the identical multi-component system of requestUID The call relation serviced in the middle same request of discovery, the service for having call relation is added in topological diagram G, the point in figure is Service Instance, side indicates the call relation of service, if figure midpoint or side have existed, does not repeat to add.Repeat above-mentioned mistake Journey, until set S is sky；

(4) monitoring metrics value and service inner joint call number in servicing are collected, respectively as Lasso regression model Response variable and explanatory variable.Wherein, Y is used_tIndicate response of the measurement m in the monitor value of moment t, as Lasso regression model Variable uses vectorIndicate in moment t, service i to some service in q interface calling it is secondary Number, as explanatory variable, whereinIndicating that moment t service i calls number in the service is t₁Interface number, finally to upper It states data and does standardization；

(5) Lasso regression model, expression formula are constructed based on above-mentioned data are as follows:Wherein Y_tTable Indication amount m is in the monitor value of moment t, and p is the number for initiating the service service called, and q indicates of the service inner joint Number,For regression coefficient,Indicating that moment t service i calls number in the service is t₁Interface number, α is random error ?；In constraint conditionUnder, pass through coordinate descent minimizationWherein c is Adjusting parameter；

(6) adjusting parameter c, the form of Generalized Cross Validation method are selected using Generalized Cross Validation method are as follows:Wherein RSS (c) indicates residual sum of squares (RSS):Y_tIt indicates Measure m moment t monitor value, p (c) be Lasso return in effectively regression coefficient number, N be monitoring measure number；

(7) during service operation, metric is predicted based on Lasso regression model, calculates residual error:Wherein Y_tIndicate measurement m moment t monitor value, when residual absolute value be greater than setting threshold value when, It is abnormal to assert that measurement occurs, and then it is abnormal to think that service occurs；

(8) the service call topological diagram that (3) obtain and exception service set building anomalous propagation that (7) obtain are based on Figure uses the positioning failure service of PageRank algorithm below；

(9) in the initial stage, PR initial value of the ratio of exception measurement in servicing as the service, P=[p are used₀,p₁,..., p_n]^TFor the column vector that the PR initial value of multiple services is constituted, wherein p_iFor the ratio of exception measurement in service i；

(10) pass through formulaThe PR value of each service is calculated, wherein q is Damped coefficient, I (p_j) it is to be directed toward p_jPoint set, O (p_j) it is to be directed toward p_jPoint set, P^k(p_i) it is kth time iteration service p_iScore；

(11) after successive ignition, work as P^k(p_i) meet | P^k-P^k-1| < δ, then iteration terminates；

(12) intensity of anomaly of service is ranked up according to the abnormal score of service, it is believed that highest scoring is exactly most to have Abnormal service may be caused.Inside exception service, according to the Lasso model of (5) building to the exception of the interface in service Degree is given a mark；

(13) for j-th of interface, by the parameter ω in the Lasso model of relative exception measurement_iAnd it is abnormal The prediction residual of measurement is normalized, and obtains new value a_iAnd b_i；

(14) then the exception of j-th of interface is scored atWherein n is and j-th of interface related exception The number of measurement；

(15) the abnormal score obtained according to (14), is ranked up the intensity of anomaly of interface.This can finally be found out Failure root in secondary exception is because of the exceptional interface in service and service.

In short, the present invention is based on agent skill groups to monitor service invocation information, establishes micro services and call topological diagram micro- to portray Anomalous propagation relationship between service；It is called using Lasso regression modeling interface and is associated between measurement, by the change for monitoring correlation model Change and detects abnormal micro services；Intensity of anomaly based on PageRank algorithm evaluation micro services and its calling interface, the present invention realize Transparence service monitoring, the prediction of Automatic Measurement value is with the service of noting abnormalities, the intensity of anomaly of intelligent assessment figure interior joint With test problems root because.

Claims

1. a kind of micro services intelligent monitoring method towards anomalous propagation, which is characterized in that comprise the following steps that

The first step, service call monitoring: based on agent skill group monitor service invocation information, with multi-component system N=(requestUID, ServiceUID, spanUID, parentUID, info) record service call relationship, wherein requestUID is request mark Symbol is generated in request inlet, and serviceUID is service identifier, and span indicates that a service call, spanUID are service Span identifier is called, parentUID is father span identifier, if it is -1, indicates that current span is root span, info is packet Other relevant informations contained, info=(serviceUID, startTime, endTime, duration), wherein startTime With endTime be service call start, the end time, duration be service call the execution time, monitored based on above-mentioned Service invocation information, construct service call topological diagram, to portray anomalous propagation；

Exception service detection: second step constructs the correlation model between the call number of service interface and service monitoring measurement, inspection It measures out and occurs abnormal service, the specific steps are as follows:

(1) service interface calls monitoring:It indicates to service the calling of q service interface in i in moment t The vector that number is constituted, whereinIndicate that number is t in moment t service i₁Service interface number；

(2) establish Lasso regression model based on the Lasso resource returned: the independent variable of the regression model is by step (1) The vector that the service interface call number of middle acquisition is constituted, dependent variable are monitor value of some measurement m in moment t, and building is returned Return model are as follows:WhereinFor regression coefficient, α is stochastic error；In constraint conditionUnder, solved by coordinate descent so thatMinimum regression coefficient and mistake Poor item, c are adjusting parameter；

(3) abnormal resource detects: during service operation, based on the Lasso forecast of regression model service constructed in step (2) Resource metric, calculate residual error:Wherein, Y_iIt (t) is the monitor value measured,It is to pass through Lasso model it is abnormal to assert that measurement occurs, place clothes when residual absolute value is greater than the threshold value of setting to the predicted value of measurement Business is then detected as exception, and finally detection obtains occurred abnormal service；

Failed services diagnosis: third step occurs in abnormal service and the first step according to what detection in second step obtained The service call topological diagram building fault propagation subgraph monitored, using the abnormal journey of each service of PageRank algorithm evaluation Degree；

4th step, inside failed services, the parameter of the Lasso regression model based on buildingAnd prediction residual R_i(t), into One step, which is found out, causes abnormal interface calling.

2. the micro services intelligent monitoring method according to claim 1 towards anomalous propagation, it is characterised in that: the described 4th Step, inside failed services, the parameter and Prediction Parameters of the Lasso regression model based on building are found out and cause abnormal connect Mouth calls, specific as follows:

(41) for j-th of interface, by the parameter ω in the Lasso model of relative exception measurement_iAnd exception measurement Prediction residual R_i(t) it is normalized, obtains new value a_iAnd b_i；

(42) then the exception of j-th of interface is scored atWherein n is and j-th of interface related exception measurement Number；

(43) according to the abnormal score of the interface calculated in step (2), the intensity of anomaly of interface is ranked up, is drawn to find out The abnormal interface of hair calls.