CN111737033A

CN111737033A - Micro-service fault positioning method based on runtime map analysis

Info

Publication number: CN111737033A
Application number: CN202010457981.8A
Authority: CN
Inventors: 彭鑫; 冀超
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-10-02
Anticipated expiration: 2040-05-26
Also published as: CN111737033B

Abstract

The invention belongs to the technical field of software engineering and cloud computing, and particularly relates to a micro-service fault positioning method based on runtime map analysis. The method automatically updates and maintains the micro-service runtime map based on the micro-service system runtime data, evaluates the abnormal degree of each system component by means of the data in the map when a request fault occurs, analyzes the propagation relation of the abnormal degree and finally obtains a fault positioning result; the method specifically comprises the following steps: the method comprises the steps of constructing and dynamically updating a map in real time during micro-service operation; fault location based on a runtime map; the method uses data such as service deployment, service calling and monitoring indexes of the micro-service system to construct a runtime map for describing the running state of the micro-service system; after the fault occurs, each component of the system is analyzed according to the map data, the most possible fault position is provided for developers, the fault positioning speed is increased, and the manual workload is reduced.

Description

Micro-service fault positioning method based on runtime map analysis

Technical Field

The invention belongs to the technical field of software engineering and cloud computing, and particularly relates to a micro-service fault positioning method.

Background

The Microservice (Microservice) architecture is an architectural concept that disassembles an entire application into several decoupled functional modules. Each functional module has a separate process and execution environment, and information interaction is performed between the functional modules through a lightweight communication protocol (such as an RPC protocol or an HTTP protocol), and such functional modules are called microservices. Fine-grained micro-service partitioning and isolation of the operating environment enable applications based on a micro-service architecture to be independently developed and deployed and flexibly scale as required. Micro-service architecture has become a key technology in cloud native technology and is widely used in many enterprises.

Fault location is an important part of system operation and maintenance. When the single system or the distributed system has faults, developers can search fault sources by adopting modes of increment debugging, operation track comparison, fault feature learning and the like. The microservice system is more complex and dynamic than a typical monolithic system and a distributed system, making the aforementioned approach less effective. When a failure of the micro-service system occurs, developers have to face complex service interaction relationships, diverse operating environments, and dynamically creating and destroying instances, causing failure localization to become difficult and inefficient.

Disclosure of Invention

The invention aims to provide a micro-service fault positioning method based on runtime map analysis, which can accelerate fault positioning speed and reduce manual workload.

The method uses data such as service deployment, service calling and monitoring indexes of the micro-service system to construct a runtime map for describing the running state of the micro-service system; after a fault occurs, the invention analyzes each component of the system according to the map data and provides the most probable fault position for developers.

The method automatically updates and maintains the micro-service runtime map based on the micro-service system runtime data, evaluates the abnormality degree of each system component by means of the data in the map when the request fault occurs, analyzes the propagation relation of the abnormality degree and finally obtains a fault positioning result. The method mainly comprises two parts, wherein one part is the real-time construction and dynamic updating of the graph during the micro-service operation, and the other part is the fault positioning by means of the graph during the operation. The former continuously updates map data as the system operates, while the latter reads map data and analyzes the location of the fault source when a fault occurs.

The runtime map continuously collects and associates the runtime data of multiple aspects of the microservice system, and aims to describe the running state of the microservice system. The original data of the map comprises service deployment, service invocation, monitoring indexes and the like. The high-level architecture of the micro-service runtime graph constructed by the invention is shown in fig. 1. The system comprises a virtual machine, a micro service instance, an API and a container, which are various running components in a micro service system and are also called nodes; there are three monitoring indicators: the container monitoring index, the micro-service instance monitoring index and the API monitoring index respectively correspond to the container, the micro-service instance and the interface and are used for storing monitoring index data of the corresponding node. Relationships between the running components include: "included," "deployed," and "attributed to" are used to describe the deployment architecture of the microservice system, e.g., a container "included" in a microservice instance, "deployed" in a virtual machine, and "attributed" to a microservice; the calling between the micro-service and the API and the belonging to describe the calling relation statistic between the micro-service and the API; the calling relation between the micro service instance and the API, wherein the calling in a certain request is recorded, and the calling in a certain request is recorded; three monitoring relationships describe those graph nodes (i.e., microservice instances, APIs, containers) that are under the monitoring of the monitoring system. The real-time construction of the runtime map and the dynamic updating process are executed circularly at intervals of fixed time to ensure the real-time updating of data. The process comprises the following steps:

(1) extracting a deployment architecture; the deployment architecture refers to the deployment position of the micro service instance on the virtual machine, the logical relationship between the micro service instance and the micro service and the container composition in the micro service instance; the part of data is mainly provided by a container arrangement platform; the method specifically comprises the following substeps:

1) acquiring data of a container, a micro service instance, a micro service and a virtual machine, building a new map and adding nodes;

2) acquiring the deployment position of the micro-service instance from the micro-service instance, and increasing the deployment-in relation of the instance in the graph;

3) acquiring the micro service of the micro service instance from the micro service instance, and increasing the attribution relation of the instance in the graph;

4) and acquiring the micro service instance to which the micro service instance belongs from the container, and adding the 'contained' relationship of the container in the figure.

(2) Analyzing the calling relation; the calling relation macroscopically refers to the calling relation between the micro service and the micro service; because each request is finished by a micro-service instance, the calling relationship refers to the calling relationship of a micro-service instance to other micro-service instance APIs in a certain request microscopically; the method specifically comprises the following substeps:

1) analyzing each cross-service call in each call chain, and adding an API node, calling in a certain request and completing in the certain request into the graph;

2) for each "call in certain request" and "complete in certain request" relationship, the "call", "belong to" relationship is added to the graph.

(3) Collecting monitoring indexes; the monitoring index data refers to the resource occupation amount and some performance indexes of the system components at different moments; for the container and micro-service example, the monitoring indexes are mainly CPU usage and memory occupation; for API, refers to request response time; the part of data is used for evaluating the running state of the component and judging whether the component is in a normal running state or not; the method specifically comprises the following substeps:

1) obtaining the name of a monitored component and the name of a monitoring index;

2) acquiring monitoring data of the corresponding component from the monitoring platform;

3) and adding monitoring nodes in the graph and storing data.

(4) Updating map data; comparing the newly constructed runtime atlas with an old atlas in a database, adding new data into the database and modifying changed data; after a fixed time interval, returning to the step (1) to circularly execute the process;

when a fault occurs, the fault source position is analyzed by means of the atlas data during operation; the method has the main idea that: calculating the degree (called abnormal degree) of each node deviating from the normal operation state by using the monitoring data of the nodes, analyzing the common cause among the nodes with high abnormal degree by using the relationship among the nodes, and finally outputting the result; the fault positioning method comprises the following 4 steps:

(1) triggering a fault locating process; when the system has an explicit request error, the fault positioning process can be triggered; explicit request errors include request result errors, request response times significantly outside of normal ranges, etc.

(2) Calculating the abnormal degree of each node of the map; the node abnormality degree is a measure of the degree of deviation of the operating conditions of each node in the graph from the normal state; monitoring index data of each past moment of the corresponding node is stored in the monitoring node in the graph; the invention uses the ratio of the difference value of the index value at a certain moment and the mean value of the index values in the past period relative to the standard deviation as the abnormal degree of the component; defining A (t) as the abnormal degree of a certain monitored node at the time t; defining t as the monitoring data acquisition time closest to the fault occurrence time; v. of_tValue v of the monitoring index of the node at time t_t-1Taking a value of a first monitoring index from time t onward, v_t-2Taking a value for the monitoring index from the moment t to the front for the second time, and so on; mu.s_tThe average value of the monitoring index values n times before the time t; sigma_tThe standard deviation of the monitoring indexes n times before the time t; then, for a certain monitored node, the calculation method of the abnormality degree a (t) at the time t is as follows:

the method specifically comprises the following substeps:

1) acquiring a plurality of times of monitoring index data before the latest moment of the fault occurrence;

2) calculating the mean value and the standard deviation of the data before the fault occurs;

3) and calculating the ratio of the difference between the monitoring index value and the mean value to the standard deviation at the latest moment of the fault occurrence, and taking the ratio result as the abnormal degree of the node.

(3) And analyzing the abnormal degree propagation relation. The position where the system fault directly occurs is often not the root position of the fault, and a plurality of fault positions may have factors which jointly cause the fault. In the step, a common cause among a plurality of abnormal assemblies is searched by combining the topological structure of the map and the abnormal degree of each assembly and analyzing the propagation relation of the abnormal degree in the map, so that the final fault positioning result is determined. The pseudocode process for the analysis of the outlier propagation relationship is described in the appendix. The method specifically comprises the following substeps:

1) taking each monitoring node as a starting point, traversing the abnormal degree of the monitoring node layer by layer in a breadth-first mode, and multiplying the abnormal degree of each layer which is propagated outwards by a damping coefficient;

2) after the propagation is finished, each node receives a plurality of abnormal degree values, and the abnormal degree values are summed to obtain a total abnormal degree value of each node, which is used as the final accumulated abnormal degree of each node.

(4) And (5) sorting and outputting the results. And sorting the results and outputting the results. The developer next examines the fault locations in order in the results and determines the final fault location.

The advantages of the invention are mainly three.

The invention provides the suspected fault position ordered list for the research of developers, and reduces the search range of fault positioning of the developers, thereby accelerating the fault positioning speed, reducing the time consumption of fault positioning and avoiding the situation that the developers search fault positions in a large number of operating components of the micro-service system.

The invention is deployed in a non-invasive mode, and does not interfere with the normal operation of the micro-service system.

The data collection and result output of the invention are real-time, and do not need excessive system resources and time requirements.

The method of the invention can greatly accelerate the speed of fault location and reduce the required manual workload. Three common faults of different types are injected into the open source micro service reference system TrainTicket and a fault positioning comparison experiment is carried out, so that the fault positioning time of the method is reduced by 64% on average compared with a fault positioning method based on a manual analysis system log.

Drawings

FIG. 1 is a diagram of a high-level structure of a micro-service runtime graph constructed by the present invention.

Detailed Description

The following description is directed to embodiments of runtime graph construction and runtime graph analysis-based fault localization for microservices that deploy and orchestrate containers using Docker and kubernets and monitor data collection and call chain data collection using Prometheus and Zipkin.

For the real-time construction and dynamic update of the micro-service runtime map, the implementation method comprises the following steps:

(1) and extracting the deployment architecture. Acquiring state and attribute data of a virtual machine, a micro-service instance and micro-service in a cluster from a Kubernetes platform interface; and acquiring the state and attribute data of each container from an interface provided by a Docker Daemon process on each virtual machine. The io. kubernets. pod. name attribute of the container indicates the micro-service instance to which it belongs; the nodename attribute of the microservice instance specifies its deployment location; the label of the microservice instance and the selector attribute of the microservice specify which microservice the microservice instance belongs to. And constructing a deployment architecture by using the data and storing the node state and the attribute.

(2) And analyzing the calling relation. And obtaining Trace data in the last period of time from a Zipkin platform interface. Each Span in each Trace is analyzed on a case by case basis. Url property of Span specifies the called API, node _ id property specifies the calling initiator or calling recipient microservice instance. And adding calling relation data in the graph by using the attributes and storing the API node related attributes.

(3) And collecting monitoring indexes. Acquiring CPU usage and memory occupation monitoring data of micro-service instances and containers from a Prometous platform, and storing the data in corresponding monitoring data nodes; and reading the duration attribute of each Span in each Trace of Zipkin in the latest period of time as response time data of the interface, and storing the response time data in the monitoring node corresponding to the API.

(4) And updating the map data. The newly acquired data is compared to the old spectra and the data is updated in the Neo4j database. And (5) after a fixed time interval of 5 seconds, entering the step (1) for cyclic execution.

For the fault positioning method, the implementation mode is as follows:

(1) triggering a fault location procedure. When the request fails, a developer can input the calling chain ID to trigger the fault positioning process.

(2) And calculating the abnormal degree of each node of the map. And reading time series data of the monitored nodes from the graph in the running process, calculating the mean value and the standard deviation of the data 20 times before the latest fault moment, and then calculating the multiple of the index value and the mean value difference value of the latest fault moment relative to the standard deviation to obtain the abnormal degree of each node.

(3) And analyzing the abnormal degree propagation relation. And (4) taking each monitoring node as a starting point, and spreading the abnormal degree outwards layer by layer along the topological relation of the map. The damping coefficient for each layer propagating outward is 0.7. And calculating the sum of the received abnormality degrees of each node in the graph.

(4) And (5) sorting and outputting the results. And classifying the map nodes according to the node types, and sequencing according to the sum of the degrees of abnormality. And removing nodes with the abnormality degree lower than the mean value of the type from the result. The rest is output as a result. Then, the developer can sequentially troubleshoot the faults according to the output results and judge the fault source.

Three common faults of different types are injected into the open source micro service reference system TrainTicket and a fault positioning comparison experiment is carried out, so that the fault positioning time of the method is reduced by 64% on average compared with a fault positioning method based on a manual analysis system log.

Appendix

And (3) a propagation analysis algorithm of the degree of abnormality on the map:

and inputting an image node set C with monitoring indexes in the runtime map and all image node sets V in the runtime map.

And V// returning the node set V of the result to calculate the accumulated abnormal degree of each node.

01: function FaultAnalysis(C, V)

02: for c in C:

03: dfsQueue.offer(c)

04: baseAbnormality = c.abnormality

05: while dfsQueue≠∅:

06: layerSize = dfsQueue.size()

07: baseAbnormality = baseAbnormality * 0.7

08: for (i = 0; i<layerSize; i++)

09: currNode<- dfsQueue.poll()

10: for neighborNode in currNode.nextNeighbors

11: v.scoreList.add(baseAbnormality)

12: dfsQueue.offer(v)

13: end for

14 end for

15: end while

16: end for

17: for v in V:

18: v. abnormality = average(v.scoreList) * log₂(size(v.scoreList)+ 1)

19: end for

20: return V

21: end function

Claims

1. A micro-service fault positioning method based on runtime map analysis is characterized in that a micro-service runtime map is automatically updated and maintained based on runtime data of a micro-service system, the abnormal degree of each system component is evaluated by means of the data in the map when a request fault occurs, the propagation relation of the abnormal degree is analyzed, and a fault positioning result is finally obtained; the method specifically comprises two stages: (1) the method comprises the following steps of (1) constructing and dynamically updating a graph in real time during micro-service operation, (II) positioning faults based on the graph during operation; the former continuously updates map data along with the operation of the system, and the latter reads the map data and analyzes the position of the fault source when the fault occurs;

the original data of the graph during the micro-service operation comprises service deployment, service calling and monitoring indexes; the micro-service runtime map operation architecture comprises a virtual machine, a micro-service instance, an API (application programming interface) and a container, and is various operation components in a micro-service system, also called nodes; there are three monitoring indicators: the system comprises a container monitoring index, a micro-service instance monitoring index and an API monitoring index, wherein the container monitoring index, the micro-service instance monitoring index and the API monitoring index respectively correspond to a container, a micro-service instance and an interface and are used for storing monitoring index data of corresponding nodes; relationships between the running components include: "included," "deployed," and "attributed" to describe the deployment architecture of the microservice system; the container is contained in the micro service instance, the micro service instance is deployed in the virtual machine, and the micro service instance belongs to the micro service; the calling between the micro-service and the API and the belonging to describe the calling relation statistic between the micro-service and the API; the calling relation between the micro service instance and the API, wherein the calling in a certain request is recorded, and the calling in a certain request is recorded; the three monitoring relations describe that the graph nodes, namely the micro-service instances, the API and the containers are under the monitoring of the monitoring system;

analyzing the fault root cause position by means of the runtime map data: and calculating the degree of each node deviating from the normal operation state by using the monitoring data of the nodes, namely the abnormal degree, analyzing the common cause among the nodes with high abnormal degree by using the relationship among the nodes, and finally obtaining a result.

2. The micro-service fault location method based on runtime graph analysis according to claim 1, wherein the detailed flow of the real-time construction and dynamic update stage of the micro-service runtime graph is as follows:

(1) abstraction deployment architecture

The deployment architecture refers to the deployment position of the micro service instance on the virtual machine, the logical relationship between the micro service instance and the micro service and the container composition in the micro service instance; the part of data is mainly provided by a container arrangement platform; the method specifically comprises the following substeps:

4) acquiring the micro service instance to which the container belongs from the container, and adding the 'contained' relationship of the container in the figure;

(2) resolving call relationships

The calling relation macroscopically refers to the calling relation between the micro service and the micro service; because each request is finished by a micro-service instance, the calling relationship refers to the calling relationship of a micro-service instance to other micro-service instance APIs in a certain request microscopically; the method specifically comprises the following substeps:

2) for each relation of 'call in a certain request' and 'completion in a certain request', adding the relation of 'call' and 'belonging' to the graph;

(3) collecting monitoring indicators

The monitoring index data refers to the resource occupation amount and some performance indexes of the system components at different moments; for the container and micro-service example, the monitoring indexes are mainly CPU usage and memory occupation; for API, refers to request response time; the part of data is used for evaluating the running state of the component and judging whether the component is in a normal running state or not; the method specifically comprises the following substeps:

3) adding monitoring nodes in the graph and storing data;

(4) updating map data

Comparing the newly constructed runtime atlas with an old atlas in a database, adding new data into the database and modifying changed data; after a fixed time interval, the step (1) is returned again, and the process is executed circularly.

3. The microservice fault location method based on runtime atlas analysis of claim 1, wherein the fault location phase based on runtime atlas is as follows:

(1) triggering fault location procedures

When the system has an explicit request error, the fault positioning process can be triggered; explicit request errors include request result error, request response time significantly out of normal range;

(2) calculating the abnormal degree of each node of the map

The node abnormality degree is a measure of the degree of deviation of the operating conditions of each node in the graph from the normal state; monitoring index data of each past moment of the corresponding node is stored in the monitoring node in the graph; using the ratio of the difference value of the index value at a certain moment and the mean value of the index values in a past period of time relative to the standard deviation as the abnormality degree of the component; defining A (t) as the abnormal degree of a certain monitored node at the time t; t is the monitoring data acquisition time closest to the fault occurrence time; v. of_tValue v of the monitoring index of the node at time t_t-1Taking a value of a first monitoring index from time t onward, v_t-2Taking a value for the monitoring index from the moment t to the front for the second time, and so on; mu.s_tThe average value of the monitoring index values n times before the time t; sigma_tThe standard deviation of the monitoring indexes n times before the time t; then, for a certain monitored node, the calculation method of the abnormality degree a (t) at the time t is as follows:

the method specifically comprises the following substeps:

3) calculating the ratio of the difference between the monitoring index value and the mean value to the standard deviation at the latest moment of the fault occurrence, and taking the ratio result as the abnormal degree of the node;

(3) analyzing outlier propagation relationships

Combining the topological structure of the map and the abnormality degree of each component, analyzing the propagation relation of the abnormality degree in the map, searching the common cause among the abnormal components, and further determining the final fault positioning result; the method specifically comprises the following substeps:

2) after the propagation is finished, each node receives a plurality of abnormal degree values, and the abnormal degree values are summed to obtain a total abnormal degree value of each node, which is used as the final accumulated abnormal degree of each node;

(4) sorting and outputting the results

Sorting and sorting the results and outputting the results; and (4) the developer examines the fault positions according to the sequence in the result and judges the final fault position.