CN112181759A

CN112181759A - Method for monitoring micro-service performance and diagnosing abnormity

Info

Publication number: CN112181759A
Application number: CN202010919828.2A
Authority: CN
Inventors: 郑杰生; 赖蔚蔚; 谢彬瑜; 吴广财; 陈非; 叶杭
Original assignee: Guangdong Electric Power Information Technology Co Ltd
Current assignee: Guangdong Electric Power Information Technology Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2021-01-05

Abstract

The invention discloses a method for monitoring and diagnosing the performance of micro-service and abnormity, which uses the abstraction of a resource calling chain to describe the level of resources in micro-service application, the interactive relation among the resources and associated performance information, constructs the resource calling chain by monitoring the request processing process, analyzes and diagnoses the problems of request delay and request failure by using the resource calling chain, and locates the specific resources with problems and the level thereof. By applying the method, the interaction relation, the resource level and the associated performance information among the resources are abstractly described by using the resource calling chain, so that the concrete resource and the concrete level thereof with problems can be positioned; the method is beneficial to users to know the running state and behavior of the application timely and accurately; and common request delay and failure problems are diagnosed and positioned, and the efficiency of solving the operation problems by a user is improved.

Description

Method for monitoring micro-service performance and diagnosing abnormity

Technical Field

The invention relates to a software maintenance method of micro-service application, in particular to a micro-service performance monitoring and abnormity diagnosis method based on a resource call chain, and belongs to the technical field of software.

Background

With the development of information technology, software plays an increasingly important role in production and life. Accordingly, higher requirements are placed on the quality of the software, such as reliability, availability, manageability, etc. The software quality is guaranteed to run through the whole life cycle of the software, and the software quality is controlled and guaranteed in a system design implementation stage and a system online operation stage. The principle method of software engineering provides important support for software quality assurance in the design and implementation stage. The monitoring and diagnosing technology provides support for timely and accurately knowing the state and behavior of the system in operation and finding and positioning the performance problems in the system.

IT systems and applications are becoming more complex, making understanding and performance management of system and application runtime behaviors more difficult. The management and maintenance cost of enterprise IT is rapidly increased, the satisfaction degree of users is directly influenced by the speed of problem discovery and solution, if a sudden shutdown occurs, a large number of clients are influenced, so that enterprise business cannot be developed, the confidence of the clients is dynamic, and finally the reputation and the benefit of the enterprise are damaged. With technologies such as multi-layer architecture, grid computing, etc., distributed systems are increasingly complex. The key problem is to accurately understand the running behavior of the highly distributed network application and system, and to find and accurately locate the problem in time. For users and microservice developers, when everything works normally, the platform environment is transparent, which is equivalent to a black box. However, when a problem occurs during operation, the black box makes it extremely difficult to locate and diagnose the problem. A clear understanding of the internal architecture, services, complex configuration, and components, services, and interactions between them involved in the microservice server platform is required to enable a correct solution to the problem.

The performance of multi-layer micro-services is a complex problem, simple monitoring of service components is not sufficient, key services of the micro-service server need to be monitored simultaneously, dependence and relation of each layer such as a front end, a middleware and a back end are fully considered, and the performance is analyzed in the whole micro-service range. With the increasing complexity of software and the increasing QoS requirements of users, enterprises are investing more and more in software management, and some important requirements are also put forward. On one hand, sufficient running state information is required to accurately grasp the current state of the system, and on the other hand, various performance problems during running can be timely found and accurately positioned so as to take corresponding measures to solve the problems.

Disclosure of Invention

The invention aims to provide a method for monitoring micro-service performance and diagnosing abnormity, which is used for diagnosing problems of delay or failure of request processing and positioning links, specific resources and levels of the problems.

The technical solution of the present invention for achieving the above object is a method for monitoring microservice performance and diagnosing an abnormality, characterized in that: the resource calling chain is constructed by monitoring the request processing process by using the resource calling chain abstraction to describe the resource layers in the micro-service application, the interactive relation among the resources and the associated performance information, the problems of request delay and request failure are analyzed and diagnosed by using the resource calling chain, and the specific resource with the problem and the belonged layer are positioned.

The method for monitoring microservice performance and diagnosing abnormality further comprises: step one, a resource call chain is described as a set of all resources and resource interaction relations contained in the request processing: formally representing the request as a directed tree T (R, V and E), wherein R is a tree root and represents the requested entry resource, V is a set of nodes which represent all the non-entry resources of the request, E is a set of edges which represent the interaction relationship or the source and destination pairs among the resources; step two, resource calling chain construction: before the monitored resource begins to execute, executing monitoring codes, including searching or creating resource record nodes and transmitting start time, and before the monitored resource is executed and returned, executing the monitoring codes at least for maintaining the current node collection end time; step three, diagnosing a resource calling chain: according to monitoring information of each part of resources, S is set to represent the average service time of a certain resource in a system for executing a task, R represents the average delay time, U represents the utility of the system resources, S = R (1-U) is obtained, the delay time of each type of request processing, the corresponding utility of the system resources and the number of times of request execution are measured and collected, and are represented as (R, U and Count), the service time requested under different CPU utilities is calculated, the performance abnormity in the system is detected by comparing service time changes, and the position of the resource corresponding to the performance bottleneck is positioned.

In the method for monitoring microservice performance and diagnosing abnormity, the resource calling chain comprises resources involved in the process of processing the request, the interaction relation of the resources, the hierarchy of the resources and associated performance characteristics, wherein the performance characteristics at least comprise time, times and ratio.

The method for monitoring the performance of the micro-service and diagnosing the abnormity further comprises the step that the resource calling chain has the granularity corresponding to the division of the request, the session, the application, the host and the cluster based on the request and the composite range thereof, and the resource calling chain supports the combination and the division according to the requirement of the granularity.

The monitoring and abnormality diagnosis method of the invention has prominent substantive features and remarkable progress: the interaction relation, the resource level and the associated performance information among the resources are abstractly described by using the resource calling chain, so that the concrete resource and the concrete level of the problem can be positioned; the method is beneficial to users to know the running state and behavior of the application timely and accurately; and common request delay and failure problems are diagnosed and positioned, and the efficiency of solving the operation problems by a user is improved.

Drawings

FIG. 1 is a schematic diagram of a monitoring system for micro-service performance anomaly according to the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention is provided in connection with the accompanying drawings for the purpose of understanding and controlling the technical solutions of the present invention, so as to define the protection scope of the present invention more clearly.

Given the increasing complexity of current IT system and application services, understanding and performance management of system and application runtime behaviors becomes increasingly difficult, especially in terms of multi-tier microservices. Therefore, a monitoring and diagnosis tool needs to be provided for the micro-service server, and performance problems in the positioning system are found by detecting state information or capturing abnormal information during system operation, so that the burden of an administrator is reduced, the software management investment of an enterprise is reduced, the reliability and the manageability of the system are improved, the normal operation of the system is guaranteed, and important support is provided for the success of the enterprise.

According to the innovative principle of micro-service performance monitoring and diagnosis: the resource calling chain is obtained by monitoring the request processing process by using the complex interaction relation among the resources in the multi-layer application and the levels and associated performance information of the resources in the abstraction depiction of the resource calling chain, the problems of request delay and request failure can be effectively diagnosed by using the information provided by the resource calling chain, and the specific resources with the problems and the levels to which the specific resources belong can be positioned.

The method mainly comprises the following specific implementation steps: step one, describing a resource calling chain, wherein in the enterprise system, different client requests belong to one or more service categories, and the service categories are defined based on simple request types and client IDs. The request is a broad concept, and refers to any external entity that requires the system to perform some operation, and the request action may result in the return of a request response, such as an HTTP request/response. Different resource call chain paths are adopted in the request processing process, and a plurality of different software components and service resources are called to generate request responses. Requests belonging to the same service class have the same resource requirements and the same resource call chain path is used. The set of all resources and the interaction relationships of the resources involved in processing a type of client request is called a resource call chain. Here, the resource belongs to a certain logical level, and includes logical level information of the resource. In addition, the resource may also associate other performance information as needed. A resource call chain may be understood as a simulation of a resource call path. Different request processing corresponds to different resource call chains. Each resource call chain describes a set of dynamic dependencies formed among resources upon service requests, which dependencies are formed as a result of using the resources, as well as being used by other resources.

The resource call chain includes the resources involved in the process of processing the request, the interaction relationship of the resources, the hierarchy of the resources and the associated performance characteristics (such as time, times and ratio). Therefore, the resource calling chain describes the interaction relation and the hierarchical characteristics of the resources, and meanwhile, other related performance information attached to the resource calling chain can describe the performance consumption of a single resource. The resource call chains have different scopes, and the scope is determined by the description purpose of the resource call chains, for example, to describe the resource and resource interaction relation related to the request, the corresponding resource call chain is the request scope, and the resource call chains of multiple request scopes can be grouped into a session, for example, multiple requests of a user achieve a higher-level goal, so that the corresponding resource call chain is the session scope, and the same resource call chain can also be the application scope, the host scope, the cluster scope, and the like. The scope of the resource call chain above is referred to as the granularity of the resource call chain, i.e., the resource call chain is divided into different granularities of requests, sessions, applications, hosts, clusters, etc. Based on the characteristic that the resource calling chains have different granularities, the resource calling chains need to support merging operation, namely, the fine-grained calling chains can be merged into coarse-grained calling chains. Through the division of the resource call chain granularity, the behavior and performance views of the requests in different ranges can be provided.

A request is served in the process where there is a unique portal resource from which to start its processing. In a microservice request, the portal resource typically corresponds to a JSP/Servlet component of the microservice layer. Thus, the resource call chain may be formally represented as a directed tree T (R, V, E), where R is the tree root and represents the incoming resource of the request, V is the set of nodes that represent all the non-incoming resources of the request, and E is the set of edges that represent the interaction or source and destination pairs between the resources.

The dynamic call tree records the resource interaction relationship most accurately, but is also a tree storage structure which consumes most space. Each tree node represents a resource instance involved in processing a request, and edges represent call relationships between resource instances. The use of dynamic call trees enables the accurate recording of different instances of the same resource in a request, the space overhead of such a structure being proportional to the number of resource instances involved in the service request process.

And step two, constructing a resource calling chain, wherein the resource calling chain can be constructed in the execution process of the service request, and the principle of constructing the resource calling chain is simple. The process is as follows, using CR to represent the node record of resource C, if resource D is called by resource C and DR already exists in CR's children, then using existing DR; otherwise, searching DR from the ancestor node of the C resource; if a record DR is found, then the call to the D resource is nested, requiring a reference to be created from CR to DR, while setting DR to the current record. If the D resource is not an ancestor node of the C resource, then a new resource record DR is created for the D resource and set as the CR's child.

This approach allows the resource to be individually instrumented with monitoring code without the need to know who the resource caller and the resource callee are. Each resource creates and initializes its own resource record node. In order to establish a resource call chain, monitoring code needs to be implanted at the following execution points of the monitored resource.

(1) When the resource execution starts, before the monitored resource starts to execute, the monitoring code should be executed, the resource record is generally required to be searched, if the resource record is not found, the resource record is automatically created, through some mechanism, for example, a stack is used to obtain a parent resource record node, a parent-child relationship of resource calling is established, and information such as start time, resource execution context, resource level and the like is transmitted.

(2) When the resource execution is finished, before the monitored resource is returned after the execution is finished, the monitoring code needs to be executed, for example, the current node collection end time is maintained. Special attention needs to be paid to the start and end of execution of an entry resource, since it represents both the start and end of execution of a request. By merging the resource call chains corresponding to the requests, resource call chains such as sessions, applications, hosts and clusters with larger granularity can be obtained.

And thirdly, diagnosing a resource call chain, wherein the problem that the execution time is too long or the errors are too many is very common in application, namely the problem of delay is called as the former problem, and the problem of failure is called as the latter problem. As the complexity of applications continues to increase, finding positioning delay problems and failure problems becomes increasingly difficult. Especially in complex multi-layer micro-services, the processing of requests involves numerous components and services, the interaction relationship is complex and dynamically changed, and the specific level and the specific component difficulty of the problem need to be accurately positioned. The resource calling chain describes the resource calling relationship, the resource level and the performance information, and the specific components and the specific levels thereof which have delay problems or failure problems can be effectively found and positioned by utilizing the resource calling chain information.

The configuration of the application server itself or changes to the application (e.g., version upgrades) can also cause performance problems, and since the configuration of the application server is complex, modifications to the application program can be ignored over time, making it difficult to discover the performance bottlenecks caused by these reasons. By using the resource call chain information and the anomaly detection method based on the service time stamp, the performance problem caused by the configuration of the application server or the change of the application can be effectively found and positioned.

The delay problem can be diagnosed using a threshold-based method, and the delay problem can be identified by determining that the execution time of the resource exceeds a preset threshold. The specific component where the delay problem occurs and the level to which the specific component belongs can be located through resource calling chain analysis. And based on the resource calling chain, the timeliness threshold can be set for different stages in the calling chain, so that the flexibility of setting the threshold is improved, the constraint requirements of various SLAs can be met, and the requirements of enterprise development and change can be better met.

The failure problem is attributed to the threshold-based diagnosis because an exception occurs during execution, which is understood as the execution status exceeding a preset normal threshold, i.e. the normal status behavior is considered as a built-in threshold, which cannot be changed. By capturing exception information thrown out in the program, failure problems can be found, and detailed information of the occurrence of exceptions can be obtained. In order to detect an abnormal problem caused by exceeding the threshold, an appropriate threshold needs to be set. The execution time of the resource cannot exceed the threshold X at any time, where X is set by the user.

Fig. 1 is a schematic diagram of a monitoring system for microservice performance anomaly according to the present invention, and the detailed implementation is as follows.

The method comprises the steps of collecting performance data and context relations in the process that an application server processes a user request, constructing and storing a resource calling chain related to the request, and giving a diagnosis result of the user request through diagnosis and analysis of monitoring data. The monitoring and diagnosing frame consists of mainly monitor module, monitoring data processing center module and diagnosing and analyzing module. The monitor module provides various monitors which cooperate with each other to monitor the use condition of various resources involved in the process of processing the user request by the application server and the context relationship of mutual cooperation of the resources, and then the monitor module is responsible for sending the collected monitoring data to the monitoring data processing center, and the monitoring data is further processed by the monitoring data processing center. The monitoring data processing center module is responsible for receiving the data from the monitor module, filtering, merging and the like the data to construct a resource calling chain related to the request, and storing the resource calling chain information in the information repository for further showing and analyzing. The diagnosis and analysis engine module is mainly responsible for analyzing target performance data in a data center storage library and giving a problem diagnosis result.

(1) And the monitor module is responsible for acquiring the original information used by the resources in the request processing process and comprises a controller, a monitor and an information collector submodule. The controller provides the function of opening and closing the monitor switch during operation, and controls the monitor switch through the controller, so that a user can customize the opening and closing of the monitor according to the requirement of the user. The controller improves the manageability and the customizability of the monitor, and simultaneously, the monitor has a switch function during operation, thereby avoiding unnecessary continuous opening of the monitor and being beneficial to reducing the system overhead caused by continuous monitoring. The collector is primarily responsible for temporarily storing raw monitoring data from each particular monitor, where some pre-processing operations may also be performed on the monitoring data before sending the data to the data processing center.

Different monitors are provided according to different types of monitored resources, and the design enables the monitors to collect information of specific resources, such as request paths of monitoring JSP which can obtain JSP, and db which can obtain connection pool and query Statement string of monitoring SQL State. To make the monitoring and diagnostics framework highly portable, the interfaces defined in the relevant specifications for the resource are programmed, not for the specific implementation of a particular resource. In this way, the monitor and diagnosis framework is able to monitor any resource that implements the specification interface definition. Currently, a JSP/Servlet monitor, a JDBCConnection monitor, a JDBCStatement monitor, etc. are provided for standard components and services defined by the J2EE specification, and these monitors are respectively responsible for monitoring corresponding resources involved in the request processing process. A request for a certain user is displayed: in the processing process of acquiring database connection from a Web component, then executing a database query statement and finally returning a result, how a JSP/Servlet monitor, a JDBCConnection monitor and a JDBCStatement monitor cooperate with each other to monitor the processing process of the request.

(2) The data processing center is responsible for processing and storing original monitoring data and mainly comprises a data collector, a data processor, a data transmitter and a data storage library. The data collector module is responsible for collecting original monitoring data from the monitor module, extracting resource level and resource interaction context information from the original monitoring data and constructing a resource calling chain requesting range granularity. The data collector module comprises a plurality of sub-modules, wherein the monitoring data collecting sub-module receives the original data from the monitor and sends the original data to each sub-module to be responsible for extracting the data which is interested by the sub-module.

And the data processor module is used for filtering, combining, pre-analyzing and the like on the fine-grained resource calling chains to obtain coarse-grained resource calling chain information. The data processing module comprises a filter, a merger and a pre-analyzer sub-module. The filter filters out the resource data information which does not meet the requirements according to the set conditions so as to reduce the processing burden of subsequent analysis and combination. The pre-analyzer is responsible for checking whether the execution time of each resource in the resource call chain exceeds a preset threshold or not, or whether an abnormality occurs in the resource execution process, and the analysis result is recorded in the performance information of the resource call chain. The merger is responsible for merging fine-grained resource call chains into coarse-grained resource call chains.

And the data transmitter is responsible for transmitting the merged coarse-grained resource call chain information to the information repository. The information repository is responsible for data storage and provides data access services to the outside so as to expose resource call chain information or utilize the data for further diagnostic analysis.

(3) And the diagnosis and analysis module is responsible for analyzing the obtained monitoring data, diagnosing delay problems and failure problems in application operation and performance problems caused by improper allocation of pooled resources of the application server or change of the application program. The diagnosis analyzer and the knowledge base are cooperated with each other to analyze and diagnose delay problems and failure problems generated in the request processing process. The service time stamp-based diagnostic device is an implementation of the service time stamp-based anomaly detection method proposed in chapter iii, and is used to diagnose performance problems caused by misallocation of pooled resources of an application server or application changes.

The diagnostic analyzer analyzes the obtained monitoring data using the problems defined in the knowledge base and the rules to diagnose delay problems and failure problems occurring at runtime. The knowledge base is preset with knowledge of common delay problems and failure problems in the multi-layer application request, including knowledge of delay and failure problems of Web, JDBCConnection and JDBCStatement. The knowledge base also provides rules for determining the occurrence of request delay problems and request failure problems,

based on the problem description and the problem judgment rule defined in the knowledge base, the diagnosis analyzer analyzes target monitoring data, further diagnoses the problems of request delay and request failure in operation, and can locate the resources causing the problems and the concrete level of the resources, such as Web, JDBCConnection, JDBCStatement and the like. The service time stamp-based diagnostic module uses the service time stamp-based anomaly detection method proposed earlier herein to diagnose performance bottlenecks caused by improper configuration of the application server or application logic changes.

In summary, as can be seen from the detailed description of the embodiments of the micro-service performance monitoring and anomaly diagnosis of the present invention, the method has prominent substantive features and significant progressions: the interaction relation, the resource level and the associated performance information among the resources are abstractly described by using the resource calling chain, so that the concrete resource and the concrete level of the problem can be positioned; the method is beneficial to users to know the running state and behavior of the application timely and accurately; and common request delay and failure problems are diagnosed and positioned, and the efficiency of solving the operation problems by a user is improved.

In addition to the above embodiments, the present invention may have other embodiments, and any technical solutions formed by equivalent substitutions or equivalent transformations are within the scope of the present invention as claimed.

Claims

1. A method of microservice performance monitoring and anomaly diagnosis, comprising: the resource calling chain is constructed by monitoring the request processing process by using the resource calling chain abstraction to describe the resource layers in the micro-service application, the interactive relation among the resources and the associated performance information, the problems of request delay and request failure are analyzed and diagnosed by using the resource calling chain, and the specific resource with the problem and the belonged layer are positioned.

2. The microservice performance monitoring and anomaly diagnosis method of claim 1, comprising:

step one, a resource call chain is described as a set of all resources and resource interaction relations contained in the request processing: formally representing the request as a directed tree T (R, V and E), wherein R is a tree root and represents the requested entry resource, V is a set of nodes which represent all the non-entry resources of the request, E is a set of edges which represent the interaction relationship or the source and destination pairs among the resources;

step two, resource calling chain construction: before the monitored resource begins to execute, executing monitoring codes, including searching or creating resource record nodes and transmitting start time, and before the monitored resource is executed and returned, executing the monitoring codes at least for maintaining the current node collection end time;

step three, diagnosing a resource calling chain: according to monitoring information of each part of resources, S is set to represent the average service time of a certain resource in a system for executing a task, R represents the average delay time, U represents the utility of the system resources, S = R (1-U) is obtained, the delay time of each type of request processing, the corresponding utility of the system resources and the number of times of request execution are measured and collected, and are represented as (R, U and Count), the service time requested under different CPU utilities is calculated, the performance abnormity in the system is detected by comparing service time changes, and the position of the resource corresponding to the performance bottleneck is positioned.

3. The microservice performance monitoring and anomaly diagnosis method of claim 1, wherein: the resource call chain contains resources involved in processing the request, their interaction relationships, their hierarchy and associated performance characteristics including at least time, number, ratio.

4. The microservice performance monitoring and anomaly diagnosis method of claim 1, wherein: based on the request and the composite range thereof, the resource calling chain has the granularity of corresponding request, session, application, host and cluster division, and the resource calling chain supports merging and division according to the granularity requirement.

5. The microservice performance monitoring and anomaly diagnosis method of claim 2, wherein: and step two, the method for creating the resource record node comprises the steps of obtaining a parent resource record node by using a stack, establishing a parent-child relationship of resource calling, and synchronously transmitting information of resource execution context and resource level.

6. The microservice performance monitoring and anomaly diagnosis method of claim 2, wherein: the third step includes diagnosis of corresponding delay, setting time threshold for nodes in the resource calling chain, and positioning specific resources and the levels thereof with delay problems in the request processing process.

7. The microservice performance monitoring and anomaly diagnosis method of claim 2, wherein: the third step includes diagnosis of corresponding failure, setting built-in threshold of normal state behavior for nodes in the resource calling chain, capturing abnormal signals and obtaining abnormal performance information in the request processing process, and positioning specific resources with failure problems and the levels thereof.