CN113760646A

CN113760646A - System health monitoring method, device, equipment and storage medium

Info

Publication number: CN113760646A
Application number: CN202110363225.3A
Authority: CN
Inventors: 王文彬
Original assignee: Beijing Jingdong Tuoxian Technology Co Ltd
Current assignee: Beijing Jingdong Tuoxian Technology Co Ltd
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2021-12-07

Abstract

The embodiment of the application provides a method, a device, equipment and a storage medium for monitoring system health, wherein a health monitoring request sent by a user is acquired, and the health monitoring request comprises the following steps: and starting the monitoring task corresponding to the system to be monitored according to the identifier of the system to be monitored, so that the health condition of the system to be monitored can be determined according to the monitoring configuration information of the monitoring task. According to the technical scheme, the monitoring task can be started when the monitoring request is received, the health condition can be automatically obtained according to the monitoring configuration information of the monitoring task, manual participation is not needed, the manual subjective influence is reduced, and the real-time performance and the monitoring accuracy are improved.

Description

System health monitoring method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a system health monitoring method, a system health monitoring device, system health monitoring equipment and a storage medium.

Background

With the continuous development of computer technology and internet application, various software systems appear in different fields, and in order to ensure the availability and reliability of each software system (hereinafter referred to as system) and support high-performance, high-concurrency and other service scenarios, health condition monitoring needs to be performed on the different software systems.

In the related art, because there are many factors that affect the normal operation of the system, for example, hardware deployment, Java Virtual Machine (JVM), system container port, interface performance, interface call volume, code quality, Remote Procedure Call (RPC) service, and the like, for a certain system to be monitored, a monitoring platform is mainly separately constructed for each factor of the system to be monitored, and isolation monitoring is performed, then the monitoring platforms of the factors are manually logged in to obtain monitoring results of the factors, and finally the monitoring results of the factors are manually combined to obtain the health condition of the system to be monitored.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: the monitoring results of all factors are determined by monitoring personnel and are influenced by human subjectivity, so that the problems of poor real-time performance, time and labor consumption and low accuracy of the monitoring results exist.

Disclosure of Invention

The embodiment of the application provides a system health monitoring method, a system health monitoring device, system health monitoring equipment and a storage medium, which are used for solving the problems of poor real-time performance, time and labor consumption and low monitoring result accuracy in the existing system monitoring scheme.

According to a first aspect of the present application, there is provided a system health monitoring method, comprising:

acquiring a health monitoring request sent by a user, wherein the health monitoring request comprises: identification of a system to be monitored;

starting a monitoring task corresponding to the system to be monitored according to the identifier of the system to be monitored;

and determining the health condition of the system to be monitored according to the monitoring configuration information of the monitoring task.

In one possible design, the monitoring configuration information includes: a basic index set and a weight distribution scheme corresponding to the basic index set;

correspondingly, the determining the health condition of the system to be monitored according to the monitoring configuration information of the monitoring task includes:

acquiring monitoring data of each basic index of the system to be monitored according to each basic index included in the basic index set;

determining scores of all basic indexes according to all basic index monitoring data of the system to be monitored and preset threshold data of all basic indexes;

obtaining the health score of the system to be monitored according to the score of each basic index and the weight distribution scheme corresponding to the basic index set;

and determining the health condition of the system to be monitored according to the health score of the system to be monitored and a preset health threshold value.

In a possible design, the obtaining, according to each basic index included in the basic index set, monitoring data of each basic index of the system to be monitored includes:

determining a monitoring platform related to each basic index according to each basic index included in the basic index set;

and respectively acquiring monitoring data of each basic index from the monitoring platform associated with each basic index.

In a possible design, after the obtaining, according to each basic index included in the basic index set, monitoring data of each basic index of the system to be monitored, the method further includes:

and packaging the basic index monitoring data aiming at each basic index to obtain a target data body corresponding to the basic index, wherein the name of the target data body comprises: the identification of the system to be monitored, the type of the basic index and the identification of the basic index;

storing the target data body corresponding to the basic index into a cache space;

and periodically persisting the data in the cache space to a database.

In a possible design, the determining the score of each basic index according to each basic index monitoring data of the system to be monitored and preset each basic index threshold data includes:

calculating the ratio between the basic index monitoring data corresponding to the basic indexes and the basic index threshold data aiming at each basic index;

and when the ratio and a preset threshold value meet a preset relation, calculating the score of the basic index according to the ratio and a preset health full score.

In one possible design, the set of base metrics includes at least one of: code quality, service function, container survival, hardware usage information, startup quality of service.

In one possible design, before the obtaining the health monitoring request issued by the user, the method further includes:

acquiring a task establishment request of the system to be monitored, wherein the task establishment request comprises: the identity of the system to be monitored;

establishing a monitoring task for the system to be monitored according to the task establishing request;

and acquiring and storing the monitoring configuration information of the monitoring task.

According to a second aspect of the present application, there is provided a system health monitoring device comprising:

an obtaining module, configured to obtain a health monitoring request sent by a user, where the health monitoring request includes: identification of a system to be monitored;

the starting module is used for starting a monitoring task corresponding to the system to be monitored according to the identifier of the system to be monitored;

and the processing module is used for determining the health condition of the system to be monitored according to the monitoring configuration information of the monitoring task.

correspondingly, the processing module is specifically configured to:

In a possible design, the processing module is configured to obtain, according to each basic index included in the basic index set, monitoring data of each basic index of the system to be monitored, specifically:

the processing module is specifically configured to:

In one possible design, the processing module is further configured to:

and periodically persisting the data in the cache space to a database.

In one possible design, the processing module is configured to determine a score of each basic index according to each basic index monitoring data of the system to be monitored and preset each basic index threshold data, and specifically includes:

the processing module is specifically configured to:

In a possible design, the obtaining module is further configured to obtain a task establishment request of the system to be monitored, where the task establishment request includes: the identity of the system to be monitored;

the processing module is further configured to:

According to a third aspect of the present application, there is provided a monitoring processing device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect and possible designs when executing the computer program.

According to a fourth aspect of the present application, there is provided a computer-readable storage medium having stored thereon computer-executable instructions for performing the method of the first aspect and possible designs as described above when executed by a processor.

According to a fifth aspect of the present application, there is provided a computer program product comprising: a computer program stored in a readable storage medium from which at least one processor of a monitoring processing device can read the computer program, execution of the computer program by the at least one processor causing the monitoring processing device to perform the method of the first aspect.

The method, the device, the equipment and the storage medium for monitoring the system health provided by the embodiment of the application acquire a health monitoring request sent by a user, wherein the health monitoring request comprises the following steps: and starting the monitoring task corresponding to the system to be monitored according to the identifier of the system to be monitored, so that the health condition of the system to be monitored can be determined according to the monitoring configuration information of the monitoring task. According to the technical scheme, the monitoring task can be started when the monitoring request is received, the health condition can be automatically obtained according to the monitoring configuration information of the monitoring task, manual participation is not needed, the manual subjective influence is reduced, and the real-time performance and the monitoring accuracy are improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic view of an application scenario of a system health monitoring method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a first embodiment of a system health monitoring method provided in the present application;

fig. 3 is a schematic flow chart of a second embodiment of a system health monitoring method provided in the present application;

fig. 4 is a schematic diagram of collecting basic index monitoring data provided in an embodiment of the present application;

FIG. 5 is a diagram illustrating the distribution of each basic index in the basic index set;

FIG. 6 is a schematic diagram of health scores of a plurality of systems to be monitored shown in an embodiment of the present application;

FIG. 7 is a schematic illustration of determining a transaction system health score in an embodiment of the present application;

fig. 8 is a schematic flow chart illustrating a third embodiment of a system health monitoring method provided in the present application;

FIG. 9 is a schematic process diagram of a monitoring task established for a system to be monitored according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an embodiment of a system health monitoring apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an embodiment of a monitoring processing device provided in the present application.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, terms related to embodiments of the present application will be explained:

java Virtual Machine (Java Virtual Machine, JVM): the specification is used for computing equipment, is a fictitious computer and is realized by simulating various computer functions on an actual computer.

Remote Procedure Call (RPC): it is understood that one node requests a service provided by another node.

The cache is a buffer area (cache) for data exchange, when the browser executes a request, the browser firstly searches in the cache, and if the browser exists, the browser acquires the data; otherwise, the database is accessed. The reading speed of the cached data is high.

The redis database is a cache database, belongs to a non-relational database, and is used for storing frequently used data, so that the times of accessing the database are reduced, the operation efficiency is improved, and the storage time is limited.

mysql is a relational database for persistent storage, is mainly used for storing persistent data and storing the data in a hard disk, and has a slow reading speed.

Based on the background technology, it can be known that, in the prior art, monitoring for the system health condition is mainly implemented by isolating and monitoring each factor influencing the system health by manpower, and summarizing the monitoring results to determine the health condition of the system to be monitored.

Further, since there are many factors that determine whether a system is healthy, such as system deployment hardware, system JVM, system container port, system interface performance, system interface call volume, system code quality, and system RPC service, each of the above factors may affect the normal operation of the system, and determine the health condition of the system. Moreover, since different systems are located in different fields, it is determined that each system has different attention to each factor, and therefore, a fixed set of system health monitoring rules cannot be applied to systems in various fields.

Aiming at the technical problems, the conception process of the technical scheme of the application is as follows: the inventor finds that a Key Performance Indicator (KPI) assessment scheme exists in enterprise performance management in practice, and the KPI is a target type quantitative management index for measuring the process performance by setting, sampling, calculating and analyzing key parameters of an input end and an output end of an internal process of an organization, can be a tool for decomposing a strategic target of an enterprise into operable working targets and is a basis of enterprise performance management, so that the inventor applies the KPI idea to the aspect of system health monitoring, and can determine the health condition of a system to be monitored by setting a basic assessment index of each system to be monitored and assessing each index.

Based on the above conception process, the embodiment of the application provides a system health monitoring method, which obtains a health monitoring request sent by a user, wherein the health monitoring request comprises: and starting the monitoring task corresponding to the system to be monitored according to the identifier of the system to be monitored, so that the health condition of the system to be monitored can be determined according to the monitoring configuration information of the monitoring task. According to the technical scheme, the monitoring task can be started when the monitoring request is received, the health condition can be automatically obtained according to the monitoring configuration information of the monitoring task, manual participation is not needed, the manual subjective influence is reduced, and the real-time performance and the monitoring accuracy are improved.

Fig. 1 is a schematic view of an application scenario of a system health monitoring method according to an embodiment of the present application. As shown in fig. 1, the application scenario may include: at least one system to be monitored 11 and a monitoring processing device 12. The monitoring processing device 12 and the service device 110 of the system to be monitored 11 may perform information communication, for example, the monitoring processing device 12 may obtain basic index monitoring data corresponding to each basic index of the system to be monitored 11.

For example, in practical application, at least one monitoring platform 13 is further disposed between each to-be-monitored system 11 and the monitoring processing device 12, and each monitoring platform may monitor at least one basic index of the at least one to-be-monitored system, determine each basic index monitoring data corresponding to each basic index, and transmit the basic index monitoring data to the monitoring processing device 12.

Correspondingly, the monitoring processing device 12 may receive a health monitoring request triggered by the outside, start a monitoring task corresponding to the system to be monitored according to an identifier of the system to be monitored carried in the health monitoring request, and determine the health condition of the system to be monitored based on monitoring configuration information of the monitoring task.

Specifically, the monitoring processing device 12 stores in advance monitoring configuration information of a monitoring task corresponding to each system to be monitored, for example, a monitored basic index set and a weight distribution scheme corresponding to the basic index set, and then, after receiving a health monitoring request, can automatically execute the technical scheme of the present application, so that the health condition of the system to be monitored is determined without human participation in monitoring.

In a possible design of the present application, the monitoring processing device 12 may have an operation interactive interface 120, so that when the monitoring processing device 12 determines the health condition of the system to be monitored, the health condition of the system to be monitored can be displayed through the operation interactive interface 120, so that the monitoring personnel can obtain the health condition of each system to be monitored in time.

In a possible design of the present application, the application scenario may further include a display device, which is connected to the monitoring processing device 12, and is capable of receiving and displaying a processing result, such as a health condition, of the monitoring processing device 12.

It can be understood that the embodiment of the present application is explained by including one system to be monitored in an application scenario, and in practical application, the number of the systems to be monitored connected to the monitoring processing device 12 is not limited, and may be determined according to actual requirements, which is not described herein again.

It should be noted that fig. 1 is only a schematic diagram of an application scenario provided by an embodiment of the present application, and the embodiment of the present application does not limit the devices included in fig. 1, nor the positional relationship between the devices in fig. 1, for example, in fig. 1, the display device may be an external device with respect to the monitoring processing device 12, or in other cases, may be an integrated device, and the embodiment of the present application does not limit the same.

In the embodiment of the present application, the monitoring processing device 12 may be implemented by a terminal device, or may be implemented by a server, and any device may be implemented as long as it can execute a computer program corresponding to the system health monitoring method provided by the present application, and the present application is not limited thereto.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a schematic flow chart of a first embodiment of a system health monitoring method provided in the present application. The embodiment of the present application is explained with the monitoring processing device shown in fig. 1 described above as an execution subject. As shown in fig. 2, the system health monitoring method may include the steps of:

s201, acquiring a health monitoring request sent by a user, wherein the health monitoring request comprises: and (5) identification of the system to be monitored.

In the embodiment of the present application, when a user (generally, a system monitoring person in charge) wants to know the health condition of a certain system, a health monitoring request may be issued, so that a monitoring processing device executing a system health monitoring method acquires an identifier of the system to be monitored, so as to execute a subsequent health monitoring processing procedure.

In a possible design, the monitoring processing device has an operation interactive interface, and at this time, the user inputs the identifier of the system to be monitored on the operation interactive interface and triggers the monitoring processing device to start working, so that the monitoring processing device can acquire the health monitoring request.

In another possible design, the monitoring processing device is in communication connection with the user terminal, and when the user inputs the identifier of the system to be monitored on the operation interactive interface of the user terminal and clicks to send the identifier, the monitoring processing device can acquire the identifier of the system to be monitored when acquiring the health monitoring request, so as to acquire the identifier of the system to be monitored, and further execute a subsequent health monitoring processing flow.

It can be understood that, in practical application, the monitoring processing device is a system health monitoring device which is uniformly built for certain software systems, and the health condition of the software systems can be monitored in real time by using the system health monitoring device.

S202, starting a monitoring task corresponding to the system to be monitored according to the identification of the system to be monitored.

For example, in this embodiment, one or more system monitoring tasks may be deployed on the monitoring processing device, so that after the identifier of the system to be monitored is obtained, the monitoring processing device may query all constructed monitoring tasks, locate the monitoring task of the system to be monitored, and start the monitoring task, thereby executing the calculation process of the health condition.

And S203, determining the health condition of the system to be monitored according to the monitoring configuration information of the monitoring task.

In this step, since the monitoring processing device can process health monitoring requests of a plurality of systems at the same time, monitoring tasks of a plurality of systems may be running at the same time. In practical application, indexes which may need to be concerned are different for different systems, so that monitoring configuration information of each monitoring task which is stored in advance is also different, and therefore, after the monitoring task corresponding to the system to be monitored is started, the stored monitoring configuration information can be firstly inquired, the monitoring configuration information of the monitoring task is determined, and then related index data of the system to be monitored is obtained based on the monitoring configuration information, and further the health condition of the system to be monitored is calculated.

Optionally, the monitoring processing device applied to the system health monitoring method provided in the embodiment of the present application may perform unified monitoring on multiple systems in multiple fields by respectively constructing the monitoring tasks, and configure corresponding monitoring configuration information, for example, multiple basic assessment indicators of each system, monitoring rules of multiple basic assessment indicators, and the like, for the monitoring task corresponding to each system, so that the system health monitoring may be finally achieved.

In the system health monitoring method provided by the embodiment of the application, a health monitoring request sent by a user is acquired, and the health monitoring request includes: and starting the monitoring task corresponding to the system to be monitored according to the identifier of the system to be monitored, so that the health condition of the system to be monitored can be determined according to the monitoring configuration information of the monitoring task. According to the technical scheme, the monitoring task can be started when the monitoring request is received, the health condition can be automatically obtained according to the monitoring configuration information of the monitoring task, manual participation is not needed, the manual subjective influence is reduced, and the real-time performance and the monitoring accuracy are improved.

For example, in the above embodiment, if the monitoring configuration information includes: the basic index set and the weight distribution scheme corresponding to the basic index set may be implemented by the following embodiment shown in fig. 3 in S203.

Fig. 3 is a schematic flow chart of a second embodiment of a system health monitoring method provided in the present application. As shown in fig. 3, S203 may be implemented by:

s301, acquiring monitoring data of each basic index of the system to be monitored according to each basic index included in the basic index set.

Illustratively, the base set of metrics includes at least one of: code quality, service function, container survival, hardware usage information, startup quality of service. The basic index sets of different systems may be different, and may be configured by a monitoring responsible person of the system according to the focus of each system, which is not limited in this embodiment.

Specifically, the monitoring processing device may determine the monitoring platform associated with each basic index according to each basic index included in the basic index set, and further obtain the monitoring data of each basic index from the monitoring platform associated with each basic index.

Optionally, after the monitoring processing device starts the monitoring task corresponding to the system to be monitored, according to the basic index set of the monitoring task, the monitoring processing device may start a plurality of timing tasks, and collect relevant data in real time according to each basic index included in the basic index set.

Exemplary, the set of base metrics includes: when indexes such as code quality, service functions, container survival conditions, hardware use information, starting service quality and the like are used, the monitoring processing equipment can start a timing task corresponding to a system to be monitored, and collect relevant data such as system code scanning, a service Message Queue (MQ), service interface calling, system container port survival, a system container virtual machine, hardware information where the system is located, RPC service survival and the like.

For example, fig. 4 is a schematic diagram of collecting basic index monitoring data provided in the embodiment of the present application. Fig. 5 is a schematic diagram of the distribution of each basic index in the basic index set. As shown in fig. 4 and 5, it is assumed that the basic index set of the system to be monitored may include: code quality monitoring, MQ service function monitoring, container monitoring, hardware monitoring, RPC monitoring and the like. Wherein, the container monitoring can include again: method call monitoring, system survival monitoring, virtual machine monitoring and the like, wherein hardware monitoring comprises the following steps: the method comprises the following steps of monitoring the utilization rate of a disk and monitoring the system memory, wherein RPC monitoring refers to RPC service monitoring and the like.

Optionally, the code quality monitoring is mainly to monitor the quality problem existing in the code of the system to be monitored, for example, the blocking problem (i.e. very serious problem) and the serious problem, and generate code monitoring data. Optionally, the code monitoring data is mainly obtained from an Enterprise Operating System (EOS) monitoring platform associated with the code.

It can be understood that the blocking problem and the serious problem are obtained by dividing the degree of the problem occurring in the index of the code according to the preset information, and the embodiment does not limit the specific boundary of the division, and can be determined according to the actual needs, which is not described herein again.

The service function monitoring is mainly to monitor the service call amount in unit time, and the received service MQ message body is also called as service monitoring data. Alternatively, the MQ message body may include both defined MQ message bodies and custom message bodies.

The container monitoring-method calling monitoring mainly monitors the calling amount, the availability ratio and the TP99 performance of a service interface in a system to be monitored in unit time, so that function monitoring data can be obtained. Among them, TP99 is the lowest time consuming required to satisfy ninety-nine percent of network requests.

The container monitoring-system survival monitoring mainly monitors the port survival and the heartbeat survival of the container where the monitoring system is located, and survival monitoring data are obtained.

The container monitoring-virtual machine monitoring mainly monitors the full Gc times of the container JVM where the system to be monitored is located and all thread numbers of the current system, and obtains virtual machine monitoring data. Full GC is to clean up the entire heap space (including young and permanent generations), and GC is just a Java garbage reclamation mechanism.

In an embodiment of the present application, the monitoring platform associated with the container is an integrated database platform (UMP), and therefore, the function monitoring data, the survival monitoring data, and the virtual machine monitoring data are all obtained from the UMP.

The hardware monitoring is mainly to monitor the disk utilization rate and the CPU utilization rate of a Docker machine deployed in a system to be monitored, and the obtained hardware monitoring data is mainly obtained from a manufacturing data collection and management system (MDC).

The RPC service monitoring mainly monitors the number of RPC service live instances in a system to be monitored, and the obtained micro-service monitoring data mainly come from Java Service Frameworks (JSF), namely a standard framework for constructing Java Web application programs.

Thus, as can be seen from the above analysis, the monitoring processing device may periodically (e.g., in the dimension of minutes) call an Application Programming Interface (API) of a monitoring platform such as EOS, UMP, JSF, MDC, etc., and listen to the MQ information in real time. Thereby acquiring the monitoring data of each basic index.

Optionally, in fig. 4, after the monitoring processing device obtains each piece of basic index monitoring data, first, each piece of basic index monitoring data is stored in the Redis as cache data, and then stored in the MySQL as historical data.

Further, in the embodiment of the present application, after the monitoring processing device obtains each basic index monitoring data of the system to be monitored according to each basic index included in the basic index set, the following steps may also be performed:

a1, packaging the basic index monitoring data aiming at each basic index to obtain a target data body corresponding to the basic index, wherein the name of the target data body comprises: the identification of the system to be monitored, the type of the basic index and the identification of the basic index;

a2, storing a target data body corresponding to the basic index into a cache space;

a3, periodically persisting the data in the cache space to the database.

For example, in order to facilitate subsequent reading and storage, after acquiring the basic index monitoring data of each basic index, the monitoring processing device may package the basic index monitoring data into a Json data volume, and use the identifier of the system to be monitored + the basic index type + the basic index id as a list name, and finally store the assembled data into a list data result in a cache space (Redis) in a unified manner, for example, store the basic index monitoring data of all systems to be monitored generated on the same day into MySQL every morning so that the basic index monitoring data can be acquired by using Elasticsearch for subsequent use.

JSON (JavaScript Object Notification) is a lightweight data exchange format. JSON employs a text format that is completely language independent, and these features make JSON an ideal data exchange language that is easy for humans to read and write, and also easy for machine parsing and generation. The Elasticissearch is a search server based on Lucene, provides a full-text search engine with distributed multi-user capability and is realized based on a RESTful web interface.

In this embodiment, each piece of basic index monitoring data is firstly stored in the cache space and then periodically persisted to the database, so that not only can the stability of the data be ensured, but also the reading efficiency of the data can be ensured.

S302, determining scores of all basic indexes according to all basic index monitoring data of the system to be monitored and preset threshold data of all basic indexes.

For example, in an embodiment of the application, after acquiring the monitoring data of each basic index of the system to be monitored, the monitoring processing device may acquire preset threshold data of each basic index, determine the monitoring condition of each basic index by using the monitoring data of each basic index and the threshold data of the basic index, and further calculate the score of each basic index according to a weight distribution scheme corresponding to a preset basic index set.

In practical application, the monitoring processing device may first calculate, for each basic index, a ratio between basic index monitoring data corresponding to the basic index and basic index threshold data, and then calculate a score of the basic index according to the ratio and a preset health score when the ratio and a preset threshold satisfy a preset relationship.

In this step, it is assumed that the preset monitoring full score of each basic index is 100, and the preset threshold is 1, which are taken as examples for explanation, and the scores of the basic indexes are calculated respectively as follows.

For code quality monitoring, the base indicator monitoring data may include a number of actual blocking problems and a number of actual serious problems, and correspondingly, the base indicator threshold data may include: a congestion problem quantity threshold and a severe problem quantity threshold. Thus, for the blocking problem, a first ratio of the number of actual blocking problems to a threshold value of the number of blocking problems is determined, when the first ratio is less than 1, the score of the blocking problem is equal to a first ratio x 100 minutes, and when the first ratio is greater than or equal to 1, the score of the blocking problem is equal to 0 minutes; for the serious problem, a second ratio of the number of the actual serious problems to the threshold value of the number of the serious problems is determined, when the second ratio is less than 1, the score of the serious problem is equal to 100 points of the second ratio, and when the second ratio is more than or equal to 1, the score of the serious problem is equal to 0 point.

For service function monitoring, the basic index monitoring data is the actual number of the MQ messages in 1 minute, and correspondingly, the basic index threshold data is the MQ message number threshold in 1 minute. Thus, for the service function monitoring, a third ratio of the actual number of the MQ messages in 1 minute to the threshold number of the MQ messages in 1 minute is first determined, when the third ratio is greater than or equal to 1, the score of the service function monitoring is equal to the third ratio x 100 points, and when the third ratio is less than 1, the score of the service function monitoring is equal to 0 points.

For container monitoring-method monitoring, the base indicator monitoring data may include: the actual calling times in method 1 minute, the actual availability rate in method 1 minute, and the actual TP99 performance in method 1 minute, and accordingly, the basic index threshold data may include: a call number threshold within method 1 minute, a availability threshold within method 1 minute, and a TP99 performance threshold within method 1 minute.

Therefore, for the number of calls within 1 minute of the method, a fourth ratio of the actual number of calls within 1 minute of the method to the threshold value of the number of calls within 1 minute of the method is first determined, when the fourth ratio is less than 1, the score of the number of calls within 1 minute of the method is equal to the fourth ratio x 100 minutes, and when the fourth ratio is greater than or equal to 1, the score of the number of calls within 1 minute of the method is equal to 0 minutes.

For the availability ratio within 1 minute of the method, a fifth ratio of the actual availability ratio within 1 minute of the method to the threshold value of the availability ratio within 1 minute of the method is determined, when the fifth ratio is larger than or equal to 1, the score of the availability ratio within 1 minute of the method is equal to the fifth ratio x 100, and when the fifth ratio is smaller than 1, the score of the availability ratio within 1 minute of the method is equal to 0.

For the TP99 performance in method 1 minute, a sixth ratio of the actual TP99 performance in 1 minute to the TP99 performance threshold in method 1 minute was first determined, with the TP99 performance score in method 1 minute equaling the sixth ratio x 100 minutes when the sixth ratio is <1 and the TP99 performance score in method 1 minute equaling 0 minutes when the sixth ratio is > 1.

For container monitoring-survival monitoring, the base indicator monitoring data is the actual survival number of the system container, and the base indicator threshold data is the survival number threshold of the system container. Thus, for system survival monitoring, a seventh ratio of the actual number of system containers that survive to the threshold number of system containers that survive is first determined, where the container monitoring-survival monitoring score equals the seventh ratio 100 points when the seventh ratio is greater than or equal to 1, and the container monitoring-survival monitoring score equals 0 points when the seventh ratio is less than 1.

For container monitoring-virtual machine monitoring, the base indicator monitoring data may include: the actual garbage recycling times of the system virtual machine and the actual number of system threads, and the basic index threshold data may include: the system virtual machine garbage collection time threshold value and the system thread number threshold value.

Therefore, for the garbage collection times of the system virtual machine, an eighth ratio of the actual garbage collection times of the system virtual machine to the garbage collection time threshold of the system virtual machine is determined, when the eighth ratio is less than 1, the garbage collection times of the system virtual machine is equal to an eighth ratio of 100 points, and when the eighth ratio is greater than or equal to 1, the garbage collection times of the system virtual machine is equal to 0 points.

In terms of the number of the system threads, a ninth ratio of the actual number of the system threads to the threshold value of the number of the system threads is determined, when the ninth ratio is less than 1, the score of the number of the system threads is equal to the ninth ratio multiplied by 100, and when the ninth ratio is greater than or equal to 1, the score of the number of the system threads is equal to 0.

For hardware monitoring, the base indicator monitoring data may include: the actual utilization rate of a system disk, the actual utilization rate of a system CPU, and the actual utilization rate of a system memory, and the basic index threshold data may include: a system disk utilization threshold, a system CPU utilization threshold and a system memory utilization threshold.

Therefore, in terms of the usage rate of the system disk, a tenth ratio of the actual usage rate of the system disk to the threshold of the usage rate of the system disk is determined, when the tenth ratio is smaller than 1, the score of the usage rate of the system disk is equal to 100 points, and when the tenth ratio is larger than or equal to 1, the score of the usage rate of the system disk is equal to 0 point.

For the utilization rate of the system CPU, an eleventh ratio of the actual utilization rate of the system CPU to the utilization rate threshold of the system CPU is determined, when the eleventh ratio is less than 1, the score of the utilization rate of the system CPU is equal to the eleventh ratio multiplied by 100, and when the eleventh ratio is more than or equal to 1, the score of the utilization rate of the system CPU is equal to 0.

For the system memory utilization rate, firstly, a twelfth ratio of the system memory actual utilization rate to the system memory utilization rate threshold is determined, when the twelfth ratio is less than 1, the score of the system memory utilization rate is equal to the twelfth ratio multiplied by 100, and when the twelfth ratio is greater than or equal to 1, the score of the system memory utilization rate is equal to 0.

For RPC service monitoring, the basic index monitoring data is the actual survival number of the RPC service of the system, and the basic index threshold data is the survival number threshold of the RPC service of the system. Therefore, for the RPC service monitoring, a thirteenth ratio of the actual survival number of the RPC service in the system to the survival number threshold of the RPC service in the system is firstly determined, when the thirteenth ratio is larger than or equal to 1, the score of the RPC service monitoring is equal to the thirteenth ratio of 100 points, and when the thirteenth ratio is smaller than 1, the score of the RPC service monitoring is equal to 0 point.

It can be understood that, in this embodiment, when the ratio satisfies the preset relationship, the score of the basic index is calculated according to the ratio and the preset health score, in practical applications, in order to simplify the calculation steps, it may also be assumed that the score of the basic index is directly taken as the preset health score when the ratio satisfies the preset relationship, and then the health score of the system to be monitored is determined according to the weight value of each basic index. Specific implementation details are not limited herein.

And S303, obtaining the health score of the system to be monitored according to the score of each basic index and the weight distribution scheme corresponding to the basic index set.

In the embodiment of the present application, in the weight distribution scheme corresponding to the basic index set, the sum of the weights of the basic indexes is equal to 100%, and thus, the health score of each basic index is equal to the product of the score of the basic index and the weight of the basic index, and accordingly, the health score of the system to be monitored is equal to the sum of the health scores of the basic indexes.

In this step, the embodiment of the present application examines each index of the system to be monitored based on the monitoring configuration information of the monitoring task configured by the system to be monitored in real time and by combining the collected monitoring data of each basic index, and finally obtains the health score of the system to be monitored.

For example, fig. 6 is a schematic diagram of health scores of a plurality of systems to be monitored shown in the embodiment of the present application. As shown in fig. 6, as seen from the operation interface of the monitoring processing device, the system management directory may include a system score showing a health score of the monitoring system corresponding to the current monitoring task and a system list, which may be a list of systems that can be monitored by the monitoring processing device.

For example, in the system score interface shown in fig. 6, health scores of the monitoring system corresponding to 6 monitoring tasks, namely monitoring task 1 to monitoring task 6, are shown. Each monitoring task has a monitoring task name, a monitoring system name, and health scores of an administrator and a monitoring system, and partial deduction items may also be shown on the interface, for example, in the order of deduction from more to less (this embodiment does not limit it), wherein in this embodiment, the full score of the monitoring system is explained as 100.

For example, the monitoring task name of monitoring task 1 is monitoring task 1, the administrator is M1, the health full score of the monitoring system is 39.51, and the deduction item includes: 10 points for N1, 6.04 points for N2, 5 points for N3, and 5 points for N4. Similarly, the monitoring task name of the monitoring task 2 is monitoring task 2, the administrator is M2, the health of the monitoring system is 35.69 points, and the point deduction item includes: 20 points for N4, 20 points for N1, 11.6 points for N2 and 10 points for N3. The monitoring task name of monitoring task 3 is monitoring task 3, the administrator is M1, the health full score of the monitoring system is 61 points, and the deduction item comprises: 10 points for N3, 5 points for N4, 3.97 points for N1 and 0.02 point for N2. The monitoring task name of the monitoring task 4 is monitoring task 4, the administrator is M1, the health completion score of the monitoring system is 53.37, and the deduction items comprise: 10 points for N3, 10 points for N4, 10 points for N5 and 6.7 points for N6. The monitoring task name of the monitoring task 5 is monitoring task 5, the administrator is M3, the health full score of the monitoring system is 0, and the deduction item comprises: and N3 is deducted by 10 points. The monitoring task name of the monitoring task 6 is monitoring task 6, the administrator is M4, the health completion score of the monitoring system is 78.37, and the deduction items comprise: 10.02 points for N6, 5 points for N3, 5 points for N4, and 1.6 points for N2.

It can be understood that, in the embodiment of the present application, the number of the deduction items of each system and the health score of each system are not limited, the specific name of each deduction item is also not limited, and the specific score of each deduction item is also not limited, which may be determined according to an actual scenario, and is not described herein again.

Further, fig. 7 is a schematic diagram of determining a transaction system health score according to an embodiment of the present application. As shown in fig. 7, in practical applications, it is assumed that the health monitoring basic indicators of the system may include: 【01】 Code monitoring, [ 02 ] service MQ monitoring, [ 03 ] function monitoring, [ 04 ] system survival monitoring, [ 05 ] system hardware monitoring, [ 06 ] virtual machine monitoring, [ 07 ] RPC service monitoring.

Therefore, when the system to be monitored is a trading system, the basic index set of the trading system corresponding to the monitoring task may include: 【01】 Code errors, a [ 02 ] lower unit quantity MQ, a [ 03 ] search function, a [ 03 ] shopping cart function, a [ 06 ] whole stack space cleaning, a [ 04 ] port survival, a [ 04 ] system heartbeat, a [ 05 ] disk monitoring and the like, wherein correspondingly, a [ 01 ] configuration index monitoring threshold value of code monitoring is a maximum error quantity threshold value, and a configuration index weight is 10%; 【02】 The configuration index monitoring threshold of the lower single-quantity MQ is a lower single-quantity threshold in 1 scheduled minute, and the configuration index weight is 10%; 【03】 The monitoring threshold value of the configuration index of the search function is the number of search success in planned 1 minute and the number of search failure in planned 1 minute, and the weight of the configuration index is 10 percent; 【03】 The monitoring threshold value of the configuration index of the shopping cart adding function is the successful number of cart adding within 1 minute and the failed number of cart adding within 1 minute, and the weight of the configuration index is 10%; 【06】 The monitoring threshold value of the configuration index for cleaning the whole heap space is the frequency for cleaning the whole heap space within 10 minutes, and the weight of the configuration index is 10%; 【04】 The monitoring threshold value of the configuration index of port survival is a port death number threshold value, and the weight of the configuration index is 10 percent; 【04】 The configuration index monitoring threshold value of the system heartbeat is a system heartbeat failure number threshold value, and the configuration index weight is 10%; 【05】 The configuration index monitoring threshold value of the disk monitoring is a disk utilization rate threshold value, and the configuration index weight is 30%.

Optionally, according to the indexes shown in fig. 7, executing S301 to S303 in this embodiment may obtain the health score of the transaction system as 90 points.

It is to be understood that the specific values and the configured basic indicators in fig. 7 are exemplary illustrations, and the present embodiment does not limit the specific values and the configured basic indicators.

S304, determining the health condition of the system to be monitored according to the health score of the system to be monitored and a preset health threshold value.

In the embodiment of the application, after the health score of the system to be monitored is determined, the monitoring processing device may query the preset health threshold, and determine the health condition of the system to be monitored according to the size relationship between the health score of the system to be monitored and the preset health threshold.

For example, if the preset health threshold of the system to be monitored is 50 minutes, when the health score of the system to be monitored is greater than or equal to 50 minutes, it is determined that the system to be monitored is in a healthy state, and when the health score of the system to be monitored is less than 50 minutes, it is determined that the system to be monitored has a health problem.

It can be understood that, according to actual needs, the health status of the system to be monitored can be further classified into different statuses such as excellent, good, qualified, and the like according to the score of the system to be monitored, and the embodiments of the present application do not limit this.

The system health monitoring method provided by the embodiment of the application obtains each basic index monitoring data of the system to be monitored according to each basic index included in the basic index set, then determines scores of each basic index according to each basic index monitoring data of the system to be monitored and each preset basic index threshold value data, obtains health scores of the system to be monitored according to the scores of each basic index and the weight distribution scheme corresponding to the basic index set, and finally determines the health condition of the system to be monitored according to the health scores of the system to be monitored and the preset health threshold value. According to the technical scheme, the health score of the system to be monitored can be automatically calculated according to the basic index monitoring data of the system to be monitored and the weight distribution scheme corresponding to the basic index set, so that the health condition of the system to be monitored can be determined, manual participation is not needed, and the monitoring accuracy is improved.

For example, on the basis of the foregoing embodiments, fig. 8 is a schematic flow chart of a third embodiment of the system health monitoring method provided by the present application. As shown in fig. 8, before S201, the system health monitoring method may further include the following steps:

s801, acquiring a task establishment request of a system to be monitored, wherein the task establishment request comprises: and (5) identification of the system to be monitored.

In the embodiment of the application, when there is a need for monitoring whether the system is healthy, the person responsible for the system may send a task establishment request including an identifier of the system to be monitored to the monitoring processing device, so that the monitoring processing device may perform a targeted operation in a targeted manner.

S802, according to the task establishment request, establishing a monitoring task for the system to be monitored.

Optionally, the monitoring processing device may obtain the identifier of the system to be monitored by analyzing the obtained task establishment request, and then may establish the monitoring task for the system to be monitored, so as to monitor the system in a targeted manner in the following.

And S803, acquiring and storing the monitoring configuration information of the monitoring task.

For example, in order to enable the monitoring processing device to automatically implement health monitoring of the system to be monitored, the system responsible person first needs to configure monitoring configuration information for the system to be monitored, including: the basic index set and the weight distribution scheme corresponding to the basic index set.

Specifically, the system administrator determines which basic indexes of the system to be monitored are according to the characteristics of the system, configures the weight values of the basic indexes according to the influence degree of each basic index on the system to be monitored, and writes the weight values into the monitoring processing equipment.

Fig. 9 is a schematic process diagram for constructing a monitoring task for a system to be monitored according to an embodiment of the present application. As shown in fig. 9, when a task establishment request is received, a monitoring task is first created for a system to be monitored, then monitoring configuration information of the monitoring task is acquired, the monitoring configuration information of the monitoring task is stored, and finally, in a specific monitoring process, when the monitoring task of the monitoring system is started, acquired basic index monitoring data is first stored into Redis, that is, cache data, which is fast in reading speed but easy to lose, and then periodically stored into MySQL, that is, bottom-of-the-book data, which is stored in MySQL and not easy to lose but slow in reading speed.

The system health monitoring method provided by the embodiment of the application obtains a task establishment request of a system to be monitored, wherein the task establishment request comprises the following steps: and the identifier of the system to be monitored establishes a monitoring task aiming at the system to be monitored according to the task establishing request, and finally acquires and stores monitoring configuration information of the monitoring task. In the technical scheme, the monitoring task is established for the system to be monitored, and the monitoring configuration information is configured, so that the realization premise is provided for the subsequent automatic determination of the health of the system to be monitored.

In the above, a specific implementation process of the system health monitoring method provided by the present application is introduced, and the following is an embodiment of the apparatus of the present application, which may be used to implement the embodiment of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 10 is a schematic structural diagram of an embodiment of a system health monitoring apparatus according to an embodiment of the present application. Referring to fig. 10, the system health monitoring apparatus may include:

an obtaining module 1001, configured to obtain a health monitoring request sent by a user, where the health monitoring request includes: identification of a system to be monitored;

a starting module 1002, configured to start a monitoring task corresponding to the system to be monitored according to the identifier of the system to be monitored;

the processing module 1003 is configured to determine a health condition of the system to be monitored according to the monitoring configuration information of the monitoring task.

correspondingly, the processing module 1003 is specifically configured to:

In a possible design, the processing module 1003 is configured to obtain, according to each basic index included in the basic index set, monitoring data of each basic index of the system to be monitored, specifically:

the processing module 1003 is specifically configured to:

In one possible design, the processing module 1003 is further configured to:

and periodically persisting the data in the cache space to a database.

In a possible design, the processing module 1003 is configured to determine scores of the basic indexes according to the monitoring data of the basic indexes of the system to be monitored and preset threshold data of the basic indexes, and specifically includes:

the processing module 1003 is specifically configured to:

In a possible design, the obtaining module 1001 is further configured to obtain a task establishment request of the system to be monitored, where the task establishment request includes: the identity of the system to be monitored;

the processing module 1003 is further configured to:

The apparatus provided in the embodiment of the present application may be used to implement the technical solution described in the embodiment of the method, and the implementation principle and the technical effect are similar, which are not described herein again.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the processing module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a function of the processing module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

Fig. 11 is a schematic structural diagram of an embodiment of a monitoring processing device provided in the present application. As shown in fig. 11, the monitoring processing device may include: the device comprises a processor 1101, a memory 1102, a communication interface 1103 and a system bus 1104, wherein the memory 1102 and the communication interface 1103 are connected with the processor 1101 through the system bus 1104 and complete mutual communication, the memory 1102 is used for storing computer programs, the communication interface 1103 is used for communicating with other devices, and the processor 1101 implements the technical scheme of the above method embodiment when executing the computer programs.

Optionally, in an embodiment of the present application, the monitoring processing device may further include an operation interaction interface 1105, where the operation interaction interface 1105 is configured to receive an indication of a user and/or display health score information of a system to be monitored. The present embodiment does not limit it.

In fig. 11, the processor 1101 may be a general-purpose processor, including a central processing unit CPU, a Network Processor (NP), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

Memory 1102 may include Random Access Memory (RAM), read-only memory (RAM), and non-volatile memory (non-volatile memory), such as at least one disk memory.

The communication interface 1103 is used to implement communication between the database access device and other devices (e.g., client, read-write library, and read-only library).

The system bus 1104 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

Optionally, an embodiment of the present application further provides a computer-readable storage medium, where a computer executing instruction is stored in the computer-readable storage medium, and when the computer executing instruction runs on a computer, the computer is enabled to execute the technical solution described in the foregoing method embodiment.

Optionally, an embodiment of the present application further provides a chip for executing the instruction, where the chip is configured to execute the technical solution described in the foregoing method embodiment.

There is also provided, in accordance with an embodiment of the present application, a computer program product, including: a computer program, stored in a readable storage medium, from which at least one processor of the monitoring processing device can read the computer program, the at least one processor executing the computer program causing the monitoring processing device to perform the solution provided by any of the embodiments described above.

Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A system health monitoring method, comprising:

2. The method of claim 1, wherein monitoring configuration information comprises: a basic index set and a weight distribution scheme corresponding to the basic index set;

3. The method according to claim 2, wherein the obtaining of the monitoring data of the basic indexes of the system to be monitored according to the basic indexes included in the basic index set comprises:

4. The method according to claim 2, wherein after the obtaining of the monitoring data of the base indexes of the system to be monitored according to the base indexes included in the base index set, the method further comprises:

and periodically persisting the data in the cache space to a database.

5. The method according to claim 2, wherein the determining the score of each basic index according to each basic index monitoring data of the system to be monitored and each preset basic index threshold data comprises:

6. The method according to any of claims 2-5, wherein the set of base metrics comprises at least one of: code quality, service function, container survival, hardware usage information, startup quality of service.

7. The method of any of claims 1-5, wherein prior to the obtaining the health monitoring request from the user, the method further comprises:

8. A system health monitoring device, comprising:

9. A monitoring processing device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of the preceding claims 1-7 when executing the computer program.

10. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, perform the method of any one of claims 1-7.

11. A computer program product, comprising: computer program, characterized in that the computer program is adapted to carry out the method of any of claims 1-7 when executed by a processor.