CN114201326A

CN114201326A - Micro-service abnormity diagnosis method based on attribute relation graph

Info

Publication number: CN114201326A
Application number: CN202111461547.8A
Authority: CN
Inventors: 何明栋; 曹阳; 王宝会
Original assignee: China Shenhua International Engineering Co ltd
Current assignee: China Shenhua International Engineering Co ltd
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-03-18

Abstract

The invention relates to a micro-service abnormity diagnosis method based on an attribute relation graph, which aims to solve the problems of poor robustness, poor diagnosis effect and the like of the existing algorithm after a micro-service system is abnormal. Firstly, according to service calling track information acquired by monitoring agent service, abnormality is detected in real time; after the abnormity is found, establishing a micro-service calling topological relation graph to depict the real-time abnormity propagation relation of the micro-service based on the calling and deployment information of the abnormity occurrence time point; then acquiring comprehensive monitoring information before and after the occurrence of the abnormality, calculating personalized weight attributes of the nodes and edges on the graph by adopting a custom formula, and establishing a micro-service attribute relation graph; and (4) evaluating the nodes on the graph based on a PageRank algorithm, and reasoning out the most possible abnormal root nodes. The invention realizes the real-time detection of the micro-service abnormity, automatically establishes the attribute relation graph and intelligently deduces the abnormity degree of the service node so as to realize the abnormity diagnosis of the micro-service.

Description

Micro-service abnormity diagnosis method based on attribute relation graph

Technical Field

The invention relates to an abnormality diagnosis method of a micro-service software system, belonging to the technical field of software.

Background

The monolithic architecture and the SOA software architecture are the architecture forms commonly adopted by software companies, and through the development of over a decade, software systems become abnormally complex, have low expansibility and maintainability, and bear heavy technical debts. The existing internet is competitive, user requirements and market environments are in rapid changes all the time, when the existing internet is applied, the expansibility and flexibility of the traditional software architecture form are obviously insufficient, and the design, development, test, operation and maintenance costs are obviously increased. Therefore, the concept of microservice is proposed, which is a software architecture that treats a single application as a suite of software services, each running in a separate process, communicating with each other via lightweight protocols. The characteristics of the micro-service architecture are very suitable for agile development and continuous integration, the pain of the traditional software architecture is solved, and the extensive attention and research of academia and industry are obtained.

After the software system is micro-serviced, the maintenance and the flexibility are improved, meanwhile, the dependency relationship between services is complicated, and the probability of occurrence of faults and the loss caused by the faults are increased. For example, in a high traffic website, a delay of a certain service component may cause all application resources to be exhausted, causing a so-called avalanche effect, which may seriously cause the whole system to be broken down. Therefore, the effective monitoring system and the rapid positioning of the fault cause are one of the key technologies for guaranteeing the reliability and the performance of the micro-service.

The following categories of work are mainly done for microservice fault diagnosis: (1) a diagnostic method based on metric monitoring. The method mainly collects system operation indexes such as CPU, memory, network and the like so as to reflect the current state of the application program and the operation trend in a period of time. If a certain measurement exceeds a preset threshold value, indicating that the system has a problem, triggering an alarm, and then solving the problem by taking the monitoring data as the basis and combining the experience of the administrator; (2) the log-based monitoring and analyzing method has the advantages that the log clearly records the running condition of the system, is convenient to persist, can be easily searched, and is an effective means for finding out the fault reason and supporting more business targets; (3) the diagnosis method based on the distributed calling track information establishes a service calling topological graph according to the calling track, and carries out root cause inference by using a search algorithm.

The fault diagnosis method based on measurement and log monitoring is simple to implement, but cannot reflect the overall state of the system and cannot track the service flow, the fault location level is usually a service component, and in a complex micro-service interaction relationship, an administrator consumes a large amount of time to search and locate problems; based on the algorithm of the distributed calling track, because the established topological graph has the server nodes only with the deployment relation and without the calling relation, the established topological graph is incomplete, and only track information is adopted, the algorithm robustness is poor, and the diagnosis effect is inaccurate.

Disclosure of Invention

The technical problem of the invention is solved: the defects of the prior art are overcome, the micro-service-oriented high-efficiency abnormity diagnosis system is provided, the topological graph is automatically established by analyzing and calling track information, the overall state of the system is reflected in real time, and the expansibility of the system is improved. The personalized weight is calculated through the mixed weighting of the three abnormal scores, the algorithm robustness is improved, and the diagnosis effect is more accurate.

The method comprises the steps of transparently calling and monitoring the service, processing calling track information in real time, detecting online abnormity, automatically analyzing a calling relation according to a detection result, combining a deployment relation, and establishing a mixed weight attribute relation graph by using various monitoring indexes including a calling failure proportion, so that only service nodes of the deployment relation can be embodied in the graph, the expansibility of the system is improved, and the abnormal root cause positioning of the service level is realized by deducing the attribute graph.

The technical solution adopted by the invention comprises the following steps:

the method comprises the steps of firstly, detecting data flow real-time abnormity, acquiring calling track information based on monitoring agent service, processing the calling track information, carrying out abnormity detection on the processed time sequence through an online clustering algorithm, and determining an abnormity occurrence time point.

The calling track information mainly comprises the following steps:

(callType, startTime, elapsedTime, success, traceId, id, pid, cmdb _ id, serviceName), wherein callType represents the call type, startTime represents the call start time, elapsedTime represents the call time this time, success represents, traceId represents the call id of a complete request, id represents the current call id, pid represents the father node id, serviceName represents the network element id to which the service belongs, and serviceName represents the service name, the specific steps are as follows:

step 101, acquiring all calling track data of a single request by calling id, reducing time consumption of child nodes by parent node time, and acquiring execution time of the node;

102, adopting a 30-second time window for the calling time between two micro-service sub-nodes, taking a median value of the calling time in the window as a single data value of a group of time sequences, and acquiring a row of calling time sequences by using the method;

103, carrying out online real-time anomaly detection on each acquired calling time sequence through a BIRCH online clustering algorithm to determine anomaly occurrence time points;

step two, establishing a real-time topological graph, and for the specific abnormal occurrence time point in the step one, determining a parent service node and a child service node related to the current time period calling by analyzing calling track data when the abnormality occurs, and constructing the real-time topological graph comprising a server, a container and a database by combining with the deployment information of the service;

step 201, in step one, the calling track time sequence is "father node-child node: and calling a time key value pair form, and analyzing a calling time sequence, wherein a father node is an out-degree node of an edge in the graph, and a child node is an in-degree node of the edge in the graph. And reading the deployment configuration file, wherein for the deployment relationship, the node where the service container is located is an out-degree node of the edge in the graph, and the deployment server node is an in-degree node of the edge in the graph.

Step 201, analyzing the calling track time sequence at the specific abnormal occurrence time point in the step one, wherein the calling track time sequence is' father node-child node: and calling a key value pair form of time', splitting parent and child service nodes, reading the configuration file, and acquiring deployment server information corresponding to the parent and child service nodes to obtain all nodes for constructing the topological graph.

Step 202, for all the nodes of the topological graph obtained in step 201, if the nodes are in a calling relationship, the father node is an out-degree node of an edge in the graph, the child node is an in-degree node of the edge in the graph, and the direction of the edge is pointed to the child node by the father node; if the nodes are in a deployment relationship, the node where the service is located is an out-degree node of an edge in the graph, the deployment server node is an in-degree node of the edge in the graph, and the direction of the edge is that the service node points to the deployment server.

Step three, calculating personalized weight, establishing an attribute relation graph, acquiring monitoring data in a period of time before and after the occurrence of the abnormality after the occurrence time point of the abnormality is determined in the step one, calculating the personalized weight according to a formula, and establishing the attribute relation graph;

step 301, constructing a node weight of a topological graph, and constructing a feature vector N of a node on the topological graph (node _ on-off, node _ s-connect, node _ network, node _ CPU, and node _ memory), where node _ on-off represents a database monitoring switch, node _ s-connect represents a maximum connection number of a database, node _ network represents a network flow, node _ CPU represents a container or server CPU usage rate, and node _ memory represents a container or memory usage rate, and at the same time, constructing a related index weight calculation formula:

wherein λ_iRepresenting the similarity of the index feature vectors before and after the occurrence of the abnormality,

and (4) taking the mean value of the feature vectors in the 5 minutes before the occurrence of the abnormality and the mean value in the 5 minutes after the occurrence of the abnormality to perform cosine similarity calculation, wherein the larger the similarity is, the lower the abnormality degree is. T is_iAnd representing attenuation coefficients for controlling that indexes on a plurality of service nodes are reflected sometimes after one abnormity occurs, such as network failure, wherein a attenuation sequence parameter is defined to control that the earlier the abnormity occurs, the higher the abnormity score is.

Step 302, the weight of the connecting edge of the nodes of the topological graph: adopting mixed weighting weight of service calling time of a period of time after the exception occurs, resource utilization information of a container or a server to which the service belongs, calling failure rate in a period of time and other information to establish a weight calculation formula: w_ij＝c₁S_t+c₂S_m+c₃S_fWherein:

①S_trepresenting a call delay exception score, S_t＝max(-logP_x)，P_xThe function is estimated for the kernel density. Acquiring a sum density function Px of a calling time sequence through normal data, and acquiring a delayed abnormal score value through a kernel density estimation function according to the calling time of an abnormal time period;

②S_mon behalf of the server resource utilization index,

the meaning and the calculation mode of the specific parameters are the same as those of step 301;

③S_frepresenting the abnormal score of the failed call, wherein a calculation formula of the abnormal score of the failed call is defined as follows, wherein n represents the number of failures, and m represents the total number of calls;

for step two, the pass coefficient c₁c₂c₃The influence degree of the indexes on different types of nodes such as server nodes, container nodes, database nodes and the like is controlled. For only deployed servers, the resource utilization index feature is more important, and for database nodes, the call failure proportion feature is more important.

And step four, diagnosing abnormal root causes, and for the attribute relationship graph established in the step three, evaluating the abnormal degree of each service on the attribute graph by using a PageRank algorithm, giving a most possible root cause node ranking list and diagnosing the abnormal root causes.

The principle of the invention is as follows: and in view of the real-time requirement of the operation and maintenance system, performing abnormity diagnosis on the data stream by adopting an unsupervised online clustering mode. After the exception occurs, due to the characteristics of multi-language, multi-node and dynamic of the micro-service system, the calling relation executed by the current service is determined by analyzing the parent-child node relation among the calling tracks, and a real-time system topological graph is established by combining the deployment information of the service, so that the exception propagation condition of the system can be accurately described. Meanwhile, the abnormal characteristics can be reflected on a plurality of monitoring indexes, the invention provides a mixed weighting mode to evaluate the abnormal degree on the attribute graph, establishes the attribute relation graph, and finally adopts a PageRank algorithm to score the abnormal degree of the service and find out the service node causing the abnormality.

Compared with the prior art, the method has the advantages that firstly, a new abnormity evaluation index is provided: and the failure ratio abnormal characteristic value enables the characteristic description of the abnormality to be richer and more accurate. Secondly, a mixed weighting calculation mode is adopted to evaluate the abnormal scores between adjacent nodes on the topological graph, and compared with the original method that only calling time single index is adopted to calculate the abnormal scores, on one hand, a server which only has a deployment relation but does not have direct interface calling can be embodied on the topological graph, so that the method for diagnosing the abnormal scores based on the topological graph is more intelligent; on the other hand, various indexes are evaluated in a mixed weighting mode, so that the abnormity diagnosis is more accurate and the robustness is stronger.

Drawings

FIG. 1 is a general flowchart of a method for diagnosing abnormal microservice based on an attribute relationship diagram according to the present invention;

FIG. 2 is an environment of use of an example method of the invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments and the accompanying drawings.

As shown in fig. 2, as a use environment of the embodiment method of the present invention, a target micro-service application is Sock-Shop, and kubernets are used as a basic operation environment to deploy service instances on pod, where 10 services of a core each have one instance, a MongoDB service has three instances, and MySQL has one instance. And each pod is provided with a proxy Agent for monitoring service calling information and measurement change in the service. The load generator simulates a user request and generates a load; the fault injector injects faults into the system through a preset script so as to test the diagnosis effect of the fault diagnosis system; the fault diagnosis system performs fault diagnosis based on the collected data. The method provided by the invention is realized in a fault diagnosis system.

As shown in fig. 1, the method flow of the embodiment of the present invention:

step one, collecting calling track information among all child nodes in the microservice through a deployed monitoring Agent, wherein the calling track information mainly comprises (callType, startTime, elapsedTime, success, traceId, id,

pid, cmdb _ id, serviceName), wherein callType represents a calling type, startTime represents a calling start time, elapsedTime represents the current calling time, success represents whether success or not, traceId represents a calling id of a complete request, id represents a current calling id, pid represents a father node id, serviceName represents a network element id to which a service belongs, serviceName represents a service name, calling track information is processed, calling time of all child nodes is subtracted by the father node calling time, a mode that a 30-second sliding window takes a median value is used for reducing data noise points, real-time data stream anomaly detection is carried out on a processed time sequence through an online clustering algorithm, and an anomaly occurrence time point is determined;

and step two, detecting the time point of the occurrence of the abnormality based on the abnormality detection in the step one, determining the calling relation among all sub-service nodes in the micro-service and the deployment relation between the sub-service nodes and the server by analyzing the calling track information and combining the deployment information of the micro-service, representing the sub-service nodes by using the nodes on the topological graph, representing the edges among the nodes on the graph by using the calling relation or the deployment relation, and automatically constructing the real-time topological graph of the micro-service application system.

Step three, detecting the abnormal occurrence time point based on the abnormality detection in the step one, acquiring detection data in a period of time before and after the abnormal occurrence, and inquiring monitoring data 5 minutes before and after the abnormal occurrence, including resource utilization information of a server or a container and a service call trackCalculating weight attributes of nodes on a topological graph, constructing a characteristic vector N (node _ on-off, node _ ss-connect, node _ network, node _ CPU, node _ memory), wherein the node _ on-off represents a database monitoring switch, the node _ ss-connect represents the maximum connection number of a database, the node _ network represents network flow, the node _ CPU represents the utilization rate of a container or a server CPU, the node _ memory represents the utilization rate of a container or a memory, and calculating the similarity of index characteristic vectors before and after abnormality occurrence

Wherein r is_iA feature vector representing the ith data point after the occurrence of the anomaly,

taking the mean value of the feature vectors of 5 data points in 5 minutes before the occurrence of the abnormality, and taking the mean value of the feature vectors in 5 minutes before the occurrence of the abnormality and the mean value in 5 minutes after the occurrence of the abnormality as cosine similarity, and simultaneously calculating a formula according to index weight:

calculating the weight attribute of each node, wherein T_iRepresents the attenuation coefficient of the ith data point after the occurrence of the anomaly, and n represents the number of data points. Determined according to the frequency of data acquisition and the number of vectors after processing, here using [0.95,0.85,0.75,0.65,0.55 ]]. Then, the upper weight of the topological graph is calculated, and the weight formula W_ij＝c₁S_t+c₂S_m+c₃S_f，S_tRepresenting abnormal track abnormality score, S_mResource utilization anomaly score, S, on behalf of a container or server_fRepresenting a failing Call proportional Exception score, parameter c₁c₂c₃Representative weight coefficient determined according to the type of node, for general server, container and database node, c₁＝c₂＝c₃0.33; for the deployment server node, since there is no direct service call, C₁＝1,C₂＝C₃＝0。

And step four, for the attribute relation graph established in the step three, positioning the abnormal service node by using a PageRank algorithm. In the initial stage, the abnormal weight value of the service node is used as the initial PR value of the service, and P is ═ P₀,P₁,,...,P_n]T is a column vector consisting of initial values of PR for a plurality of services. By the formula

Calculate the PR value for each service, where q is the damping coefficient, typically taken as 0.85, I (p)_j) Is directed to p_jSet of microservice child nodes, O (p)_j) Is p_jSet of directed microserver sub-nodes, P^k(p_i) Serving p for the kth iteration_iIs scored. After a number of iterations, when P^k(p_i) Satisfy | P^k-P^k-1I < delta, i.e. when P^k(p_i) After convergence, the iteration ends. And ranking the abnormal degrees of the services according to the abnormal scores of the services, wherein the service with the highest score is the service which is most likely to cause the abnormality.

In short, the method detects the service abnormity in real time by detecting the service call track data flow, constructs the service topological graph according to the service call relation and the service deployment information in the abnormal time period, calculates the mixed weighting weight of the nodes and edges of the topological graph according to the call track information, the resource utilization information and the failure proportion in a period of time monitored by the service, establishes the attribute relation graph of abnormal service propagation, and finally deduces the most probable abnormal root cause node by using the PageRank algorithm. The invention realizes the real-time detection of the micro-service abnormity, automatically establishes the attribute relation graph and intelligently deduces the abnormity degree of the service node so as to realize the abnormity diagnosis of the micro-service.

Claims

1. A micro-service abnormity diagnosis method based on an attribute relationship diagram is characterized by comprising the following steps:

acquiring calling track information based on a monitoring agent service, processing the calling track information, carrying out data flow real-time anomaly detection on the processed time sequence through an online clustering algorithm, and determining an anomaly occurrence time point;

step two, detecting an abnormality occurrence time point based on the abnormality detection step one, determining a calling relation between sub-service nodes in the micro-service and a deployment relation between the sub-service nodes and a server by analyzing calling track information and combining the deployment information of the micro-service, representing the sub-service nodes by using the nodes on the topological graph, representing edges between the nodes on the graph by using the calling relation or the deployment relation, and automatically constructing a real-time topological graph of the micro-service application system;

step three, detecting an abnormality occurrence time point based on the abnormality detection step one, acquiring monitoring data in a period of time before and after the abnormality occurrence, calculating personalized abnormality weight, and establishing an attribute relation graph;

and step four, evaluating the abnormal degree of each micro-service child node on the attribute relationship graph by using a PageRank algorithm for the attribute relationship graph established in the step three, obtaining a most possible root cause node ranking list, and diagnosing abnormal root causes.

2. The method for diagnosing microservice abnormality based on the attribute-relationship diagram according to claim 1, characterized in that: in the first step, when the calling track information is processed, a 30-second time window is adopted for the calling time between two micro-service sub-nodes, the median value is taken for the calling time in the window to serve as a single data value of a group of time sequences, the noise in the data is reduced in a median value taking mode, and the data quality is improved.

3. The method for diagnosing microservice abnormality based on the attribute-relationship diagram according to claim 1, characterized in that: the topological graph of the second step is established in the following way:

step 201, analyzing the calling track time sequence at the specific abnormal occurrence time point in the step one, wherein the calling track time sequence is' father node-child node: calling a key value pair form of time', splitting parent and child service nodes, reading a configuration file, and acquiring deployment server information corresponding to the parent and child service nodes to obtain all nodes for constructing a topological graph;

4. The method for diagnosing microservice abnormality based on the attribute-relationship diagram according to claim 1, characterized in that: and step three, carrying out weight calculation on the sub-service nodes and edges in the micro-service, wherein the weights are divided into node weights and edge weights, and the specific calculation steps are as follows:

step 301, calculating the node weight of the topological graph, and constructing a feature vector N of a node on the topological graph (node _ on-off, node _ s-connect, node _ network, node _ CPU, and node _ memory), where node _ on-off represents a database monitoring switch, node _ s-connect represents the maximum connection number of a database, node _ network represents network traffic, node _ CPU represents the usage rate of a container or a server CPU, and node _ memory represents the usage rate of a container or a server memory, and a related index weight calculation formula is defined by itself:

wherein λ_iRepresenting the similarity of the characteristic vectors of the indicators before and after the occurrence of an abnormality, T_iRepresenting attenuation coefficient, and calculating the abnormal degree attribute of the node by a formula;

step 302, calculating the weight of the connection edge of the nodes of the topological graph, calculating the personalized weight by adopting a mixed weighting mode of three abnormal scores, namely service calling time of a period of time before and after the occurrence of the abnormality, resource utilization information of a container or a server to which the service belongs and calling failure rate within a period of time, and establishing a weight calculation formula: w_ij＝c₁S_t+c₂S_m+c₃S_fThe specific calculation method is as follows:

S_tdelegate invocationDelay abnormal score with the formula S_t＝max(-logP_x)，P_xEstimating a function for the kernel density;

S_mrepresenting the resource utilization index of the server, the calculation formula is

S_fRepresenting the abnormal score of the failed call, which is a new abnormal evaluation index provided by the invention, the calculation formula is as follows, n represents the failure number, and m represents the total call number;