CN112149975A

CN112149975A - APM monitoring system and method based on artificial intelligence

Info

Publication number: CN112149975A
Application number: CN202010956247.6A
Authority: CN
Inventors: 朱桂芝; 杨克伟; 康俊健; 林小莎; 伍闵; 许宜斌; 李雅辉
Original assignee: Hangzhou Eastcom Software Technology Co ltd
Current assignee: Hangzhou Eastcom Software Technology Co ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-29
Anticipated expiration: 2040-09-11
Also published as: CN112149975B

Abstract

The invention provides an APM monitoring system based on artificial intelligence. In one embodiment, the index collection unit is used for collecting application program performance indexes of the application program microservice operation platform and the relation between the applications; the data analysis unit analyzes the performance index through an artificial intelligence analysis model; the alarm unit gives performance index alarm in real time and tracks and positions abnormal indexes according to the analysis result of the data analysis unit on the performance indexes; the automatic operation and maintenance unit automatically triggers the automatic expansion and contraction of the virtualization equipment of the application program microservice operation platform according to the analysis result of the data analysis unit on the performance index, and recovers the service; and the APM calls a chain topology display unit for displaying the relationship between the applications in a topological graph mode. The automatic expansion and contraction capacity recovery service of the virtualization equipment is triggered based on the automatic operation and maintenance model from the acquisition of the application performance index, so that an automatic operation and maintenance closed loop is formed.

Description

APM monitoring system and method based on artificial intelligence

Technical Field

The invention relates to the technical field of automatic operation and maintenance, in particular to an APM monitoring system and method based on artificial intelligence.

Background

The apm (application Performance management), that is, application Performance management, belongs to IT operation and maintenance management. The method mainly aims at monitoring and optimizing the IT application performance and user experience of enterprise key business, improves the reliability and quality of enterprise IT application, ensures that users obtain good service, and reduces the total IT ownership cost (TCO).

In the prior art, monitoring of IT devices and application software running thereon is performed according to a hierarchy, and the hierarchy is as shown in fig. 1, and is divided into an IAAS layer, a PAAS layer, and an SAAS layer, i.e., infrastructure (such as a network, a host, and a CPU, a memory, and a disk on a virtual machine), a system (an operating system, middleware, and a database), an application (a subsystem, a module, and an application amount, a transaction amount, a success rate, a failure rate of a function), and a front end (a user application page, an action, and the like). The monitoring process is as follows:

collecting monitoring indexes: the method comprises the following steps that maintenance personnel issue collection tasks periodically or automatically, and collect monitoring indexes such as resources, performance and alarms from an IAAS layer to an SAAS layer;

resource management: and (4) building a resource model, presenting resource data and carrying out simple statistical analysis on the data.

Topology management: manually constructing a topology from an IAAS layer to an SAAS layer;

and (3) performance management: and (3) making a monitoring strategy, presenting performance data, and sampling and statistically analyzing historical data in a certain period through a baseline algorithm.

And (4) alarm display: and setting a threshold value, and automatically alarming the service performance and the performance index of the operated IT equipment.

The existing technical scheme aims at solving the problems of monitoring and alarming of an IAAS layer and a PAAS layer, and has less attention to the performance and management of the application. In addition, as technology is updated, the hardware aspect: with network hardware function virtualization (NFV), cloud, dynamic scaling on demand and automated deployment can be performed; software aspect: software micro-servitization and distributed architecture transformation. The performance of the application, the alarm monitoring and the automation operation and maintenance are more and more problematic, for example, one request may involve a plurality of services, the service itself may depend on other services, the whole request path constitutes a mesh call chain, and once an exception occurs at a certain node in the whole call chain, the stability of the whole call chain is affected.

In addition, in the conventional APM monitoring system, artificial intelligence is only partially applied, and an automatic operation and maintenance closed loop from index acquisition to application performance self-healing cannot be formed. The existing APM monitoring system has the following disadvantages:

the index acquisition mode needs maintenance personnel to issue acquisition tasks regularly or automatically, and the workload is complex and errors are easy to occur;

the APM topology cannot automatically update the topological graph of the service system and the IT equipment operated by the service system, and the timeliness is poor;

and (5) alarm display, which cannot automatically alarm the APM service performance and the performance index of the operated IT equipment.

Failure of rapid fault self-healing: the existing application program deployment mode is changed greatly, and based on the traditional automatic operation and maintenance thought, the rapid self-healing of the service fault is difficult to achieve.

Disclosure of Invention

In view of this, the embodiment of the present application provides an APM monitoring system and a monitoring method based on artificial intelligence.

In a first aspect, the present application provides an artificial intelligence-based APM monitoring system, including:

the index acquisition unit is used for acquiring the application program performance indexes of the application program microservice operation platform and the relation between the applications;

the data analysis unit analyzes the performance index through an artificial intelligence analysis model;

the alarm unit gives performance index alarm in real time and tracks and positions abnormal indexes according to the analysis result of the data analysis unit on the performance indexes;

the automatic operation and maintenance unit automatically triggers the automatic expansion and contraction of the virtualization equipment of the application program microservice operation platform according to the analysis result of the data analysis unit on the performance index, and recovers the service;

and the APM calls a chain topology display unit for displaying the relationship between the applications in a topological graph mode.

Optionally, the system further comprises: data storage unit

And the data storage unit is used for storing the relation between the performance indexes and the application acquired by the index acquisition unit and storing the analysis statistical result of the data analysis unit.

Optionally, the system further comprises: a data query unit;

and the data query unit is used for enabling a user to query the performance index, the relation among the applications and the analysis result of the performance index.

Optionally, the system further comprises: a data display unit;

the data display unit is used for displaying the applied performance index data according to the query result of the user;

and the APM call chain topology display unit is used for displaying hardware and software components related to the application program according to the query result of the user, displaying the interaction among the software components and graphically displaying the path of the business real-time transaction.

Optionally, the index collecting unit is specifically configured to: the performance indexes of the application programs are collected in a log point burying mode, and the relation between the applications is automatically discovered through deploying the Agent.

Optionally, the performance index of the application includes: one or more of a monitoring index, a host index, a storage index, a middleware index, a virtual machine index, an application and module index, and an inter-service invocation index of the network.

In a second aspect, the present application provides an artificial intelligence-based APM monitoring method, including:

collecting performance indexes of an application program;

monitoring and analyzing the performance index of the application program in real time, and storing the performance index and the analysis result of the performance index;

and triggering to alarm the performance index exceeding the upper threshold or lower than the lower threshold according to the analysis result of the performance index, or triggering to automatically expand and contract the virtual equipment.

Optionally, the collecting the performance index of the application program and the relationship between the applications includes: the performance indexes of the application program are collected in a log point burying mode, and the application relation is automatically discovered through deploying the Agent.

Optionally, after the performing real-time monitoring and data analysis on the performance index of the application program and storing the performance index and the analysis result of the performance index, the method further includes:

responding to the query operation of a user, and displaying the collected performance indexes of the application programs and the analysis results of the performance indexes;

or responding to the query operation of the user to display the related hardware and software components of the application program, the interaction between the software components and the path of the business real-time transaction in a graphical mode.

Optionally, the triggering of automatic scaling of the virtualization device includes:

based on the automatic operation and maintenance model, when the virtual memory of the application program is insufficient, the automatic capacity expansion and contraction is triggered, and the service is recovered.

In one embodiment, the performance indexes of the application program are collected from the application program micro-service operation platform through an index collection unit, the collected application performance indexes are sent to a data analysis unit through a performance monitoring unit to be subjected to statistical analysis, and the statistical analysis results are stored in a data storage unit. And the automatic operation and maintenance unit triggers the automatic expansion and contraction of the virtualization equipment and recovers the service based on the automatic operation and maintenance model according to the analysis result of the data analysis unit. In the embodiment of the invention, the automatic expansion and contraction capacity recovery service of the virtualization equipment is triggered based on the automatic operation and maintenance model from the acquisition of the application performance index, so that an automatic operation and maintenance closed loop is formed. Furthermore, the performance monitoring unit monitors the performance indexes of the application programs in real time, the APM calling chain topology display unit displays the calling relation among the applications in a topology graph mode to achieve more visual monitoring of the calling relation of the applications, and the alarm unit alarms abnormal indexes and rapidly positions performance faults to achieve comprehensive optimization of application performance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a monitoring hierarchy in the prior art;

FIG. 2 is a schematic structural diagram of an artificial intelligence based APM monitoring system according to the present invention;

fig. 3 is a flowchart of an artificial intelligence-based APM monitoring method according to the present invention.

Detailed Description

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

In the embodiment of the invention, the APM monitoring system based on artificial intelligence is provided, and mainly focuses on performance and management of an application, and a closed loop is formed by application service index acquisition, performance index monitoring, performance alarm, topology display, intelligent operation and maintenance and application service self-healing.

Fig. 2 is a diagram of an artificial intelligence based APM monitoring system according to the present invention, and as shown in fig. 2, an artificial intelligence based APM monitoring system according to an embodiment of the present invention includes: the system comprises an application server micro-service operation platform 201, an index acquisition unit 202, a performance monitoring unit 203, a data analysis unit 204, a data storage unit 205, an automation operation and maintenance unit 206, an alarm unit 207, a data query unit 208, a data display unit 209 and an APM call chain topology display unit 210.

The application microservice runtime platform 201 includes at least one microservice and infrastructure and containers that reference the execution of the microservice.

The index collection unit 202 is used for collecting performance data of applications and automatically discovering relationships between the applications.

In one possible embodiment, the index collection unit 202 collects the performance index of the application through a log burial point. And automatically discovering the relationship between the applications by deploying the agents.

Wherein, the index of burying a collection through the log includes: monitoring indexes of the network, host indexes, storage indexes, middleware indexes, virtual machine indexes, application and module indexes, service call indexes and the like.

The monitoring indexes of the network comprise: port outflow utilization, port inflow utilization, cpu utilization, and memory utilization.

The host indicators include: cpu utilization, memory utilization, disk utilization, network card mac address, server resource information (running time, cpu core number, cpu type, memory size, operating system identification).

The storage index includes: SqlServer index: total number of available pages for the database, number of starting transactions/S, rate of distribution transactions for the instance, log size, etc.

The middleware indexes include: monitoring indexes of Nginx: connectivity, number of requests processed, number of currently active connections, average number of connections per second, etc.

The virtual machine metrics include: cpu utilization, memory utilization, disk utilization, network card input (bps), network card output (bps), etc.

Application and module metrics include: application amount, success rate, failure rate and the like of subsystems, modules and functions.

The inter-service invocation indicators include: availability, exceptions, response time, current number of waiting strokes, number of threads, number of service calls, amount of access, service availability, etc.

Automatically discovering relationships between applications by deploying agents includes: the network is scanned regularly through the server, and the application relation is automatically discovered after the agent is deployed.

The performance monitoring unit 203 analyzes the log data through the ELK platform and monitors the performance indexes acquired and applied by the index acquisition unit 202.

The data analysis unit 204 analyzes the performance index of the application acquired by the index acquisition unit 202 through an artificial intelligence analysis model.

The data storage unit 205 is used for storing the performance indexes of the applications collected by the index collection unit 202 and storing the analysis statistical results of the data analysis unit 203.

The automation operation and maintenance unit 206 triggers automatic capacity expansion and contraction to recover the service when the application performance is deteriorated, such as insufficient virtual memory, based on the automation operation and maintenance model.

In a possible embodiment, the automation operation and maintenance unit 206, through the collected memory usage rate of the application program, when the memory usage rate is lower than a certain threshold lower limit, the system prompts that the virtual memory is insufficient, and the automation operation and maintenance unit 206 creates a new copy according to a pre-configured elastic capacity expansion policy and a configuration type of a current container, and automatically adds the new copy to an existing cluster of the application program.

The alarm unit 207 is configured to alarm performance indicators of applications exceeding a threshold, give alarms for performance indicators of the APM service and the IT devices operating therein in real time based on an alarm analysis model, perform non-intrusive point burying, provide a code-level tracking and positioning fault based on a distributed tracking application program performance monitoring system, and perform code-level tracking and positioning fault.

In one possible embodiment, the upper threshold of the CPU utilization of the virtual machine is set to 70%, and when the obtained current index exceeds the upper threshold, an alarm is triggered.

The data query unit 208 is used for querying the performance index and the analysis statistic result of the application.

The data display unit 209 is used to display performance index data of the application.

The APM call chain topology presentation unit 210 is used for presenting related hardware and software components of an application program, presenting interaction among the components, and clearly and graphically presenting a path of a business real-time transaction. The method specifically comprises the following steps: and the topology nodes can be positioned to the service module, so that the calling relation chain of the application can be monitored more intuitively.

The data query unit 208 is specific to a user, and when the user needs to query the performance index of the currently acquired application or the analysis statistical result of the performance index of the acquired application, the user can query by triggering the data query unit 208. The result of the inquiry is displayed through the data display unit 209.

Further, the user may also query the relationship between the application programs, the related hardware and software components of the application programs, the interaction relationship between these components, and the path of the real-time transaction of the service by triggering the data query module 208, and display the relationship in the form of a topology map by invoking the chain topology display unit 210 by the APM. The calling relation chain of the application can be monitored more intuitively.

In the embodiment of the present invention, the performance index of the application is collected from the application microservice operating platform 201 by the index collecting unit 202, the collected performance index of the application is sent to the data analyzing unit 204 for statistical analysis by the performance monitoring unit 203, and the statistical analysis result is stored in the data storage unit 205. The automation operation and maintenance unit 206 triggers the automatic expansion and contraction of the virtualization device based on the automation operation and maintenance model according to the analysis result of the data analysis unit 204, and recovers the service. In the embodiment of the invention, the automatic expansion and contraction capacity recovery service of the virtualization equipment is triggered from the acquisition of the application performance index to the automatic operation and maintenance model, so that an automatic operation and maintenance closed loop is formed.

FIG. 3 is a flowchart of an APM monitoring method based on artificial intelligence according to the present invention, and FIG. 3 shows an APM monitoring method based on artificial intelligence according to the present invention, which includes steps S301-S303

Step S301: acquiring performance indexes and application relations of application programs;

and acquiring application performance indexes and automatically discovering application relations through log burying points or deploying agents, and storing the acquired application performance indexes and the acquired application relations in a data storage unit.

Step S302: monitoring and data analysis are carried out on the collected performance indexes in real time, and the analysis result is stored;

and analyzing the log data of the application program through the ELK platform, and monitoring the acquired performance index of the application program in real time.

The data analysis unit analyzes the acquired application performance indexes through the artificial intelligence model and stores the analysis result in the data storage unit.

Step S303: according to the analysis result of the performance index of the application program, alarming the performance index exceeding the upper threshold or being lower than the lower threshold or expanding the capacity of the infrastructure and the container of the application program;

the analysis result of the performance index of the application program is judged through the automatic operation and maintenance unit and the alarm module, and when the performance index of the application program is found to exceed the preset upper threshold or be lower than the preset lower threshold, the alarm unit is triggered to alarm the performance index, or the automatic operation and maintenance unit is triggered to automatically expand, contract and maintain, and restore the service.

In one possible embodiment, the upper threshold of the utilization rate of the CPU of the virtual machine is set to 70%, and when the obtained performance index of the application program indicates that the utilization rate of the CPU of the virtual machine is 90%, the alarm unit is triggered to alarm. When the collected performance indexes of the application program indicate that the memory utilization rate of the application program is lower than the preset threshold lower limit, the virtual memory shortage is prompted through the alarm module, the automatic operation and maintenance module is triggered to create a new copy according to the configuration type of the current container according to the elastic capacity expansion strategy configured in advance, and the new copy is automatically added into the existing cluster of the application program.

In one possible embodiment, after the collected performance index of the application program and the analysis result of the performance index are stored in the data storage unit, the user may query the collected performance index and the analysis result of the application program through the data query module and display the query result and the analysis result through the data display module.

Furthermore, the user can also query the relevant hardware and software components of the application program and the interaction of quality and safety supervision of the components through the data query module, or query the path of the real-time business transaction. And the query result is displayed in a topological chain mode. The application service calling chain of the APM is given through the APM calling chain topology display unit, and the service module can be positioned through the topology node in the calling chain, so that the system can monitor the calling relation of the application more visually.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. An artificial intelligence based APM monitoring system comprising:

2. The system of claim 1, further comprising: data storage unit

3. The system of claims 1-2, further comprising: a data query unit;

4. The system of claim 3, further comprising: a data display unit;

5. The system of claim 1, wherein the metric acquisition unit is specifically configured to: the performance indexes of the application programs are collected in a log point burying mode, and the relation between the applications is automatically discovered through deploying the Agent.

6. The system of claim 1, wherein the performance metrics of the application include: one or more of a monitoring index, a host index, a storage index, a middleware index, a virtual machine index, an application and module index, and an inter-service invocation index of the network.

7. An APM monitoring method based on artificial intelligence comprises the following steps:

collecting performance indexes of an application program;

8. The method of claim 7, wherein collecting the relationship between the performance indicators of the application and the applications comprises: the performance indexes of the application program are collected in a log point burying mode, and the application relation is automatically discovered through deploying the Agent.

9. The method of claim 7, wherein after the monitoring and data analysis of the performance indicators of the application in real time and storing the performance indicators and the analysis results of the performance indicators, the method further comprises:

10. The method of claim 7, wherein triggering automatic scaling of the virtualized device comprises: