CN115766417A

CN115766417A - Unified monitoring management platform

Info

Publication number: CN115766417A
Application number: CN202211450255.9A
Authority: CN
Inventors: 陈剑锋; 吴晔凯
Original assignee: Zhuhai Huafa New Technology Investment Holdings Co ltd
Current assignee: Zhuhai Huafa New Technology Investment Holdings Co ltd
Priority date: 2022-11-19
Filing date: 2022-11-19
Publication date: 2023-03-07

Abstract

The invention relates to the technical field of internet and discloses a unified monitoring management platform. A unified monitoring management platform, comprising: the acquisition module is used for acquiring logs of all applications, performing format conversion processing, performing unified storage and simultaneously storing a fault list; the identification module is used for identifying abnormal log information and the running state of each system, acquiring abnormal application and abnormal events and generating alarm events with topological graphs and time; the receiving and sending module is used for receiving the alarm event and sending an alarm message to operation and maintenance personnel; and the counting module is used for counting the alarm messages to obtain the operation index data of each application system and comprehensively displaying the operation index data. The method and the device can perfect the capabilities in the aspects of application monitoring, alarming, fault positioning and solving and the like, and realize a three-dimensional monitoring system from the underlying network resources to the upper application.

Description

Unified monitoring management platform

Technical Field

The invention relates to the technical field of internet, in particular to a unified monitoring management platform.

Background

At present, teams related to operation and maintenance are mainly divided into a network hardware facility, an IAAS cloud platform and a general PAAS platform, wherein the network hardware facility monitors and manages logs and performance by using a log auditing platform provided by a third party; the IAAS cloud platform comprises a VMWare part and an OpenStack part, wherein the VMWare uses a self-contained vRealixe Suite cloud management platform to monitor and manage logs and performance; openStack uses the self-contained Celimeter and open source Zabbix collocation for performance monitoring, and the unified management in the aspect of log is not realized at present.

The monitoring platform and the log management system are independent from each other among the teams, the log, monitoring and alarm management are dispersed, and the information is asymmetric, so that once the application fails, operation and maintenance personnel can hardly obtain comprehensive and latest running information, the specific reasons of the failure can be hardly positioned effectively, and the problems of overlong application failure recovery time and the like are caused. In order to improve the operation and maintenance efficiency and the service quality and quickly and effectively locate the root cause of the application fault, a uniform monitoring platform needs to be built.

Disclosure of Invention

The invention aims to provide a unified monitoring management platform to improve the capabilities in the aspects of application monitoring, alarming, fault positioning and solving and the like and realize a three-dimensional monitoring system from underlying network resources to upper application.

In order to achieve the purpose, the invention adopts the following technical scheme: a unified monitoring management platform, comprising:

the acquisition module is used for acquiring logs of all applications, performing format conversion processing, performing unified storage and simultaneously storing fault troubleshooting tables;

the identification module is used for identifying abnormal log information and the running state of each system, acquiring abnormal application and abnormal events and generating alarm events with topological graphs and time;

the receiving and sending module is used for receiving the alarm event and sending an alarm message to operation and maintenance personnel;

and the statistical module is used for carrying out statistics on the alarm message to obtain the operation index data of each application system and carrying out comprehensive display.

The principle and the advantages of the scheme are as follows: in actual application, all applications are independent, so that logs of all applications need to be collected first, format conversion processing needs to be carried out for convenience and rapidness of subsequent processing, and then unified storage is carried out; the abnormal events which occur once and the corresponding fault removing operation are stored in the fault list table, and operation and maintenance personnel can rapidly process the abnormal events according to the fault list table, so that the processing efficiency is improved.

After the logs of each application are collected, log information and the running state of each system need to be subjected to abnormal identification, abnormal applications and abnormal events are obtained, the conditions of the abnormal applications are obtained, the topological relations of the applications are contained in the abnormal applications, fault points can be checked according to the topological relations, the abnormal events are obtained, alarm events with topological graphs and time are generated, alarm messages are sent to operation and maintenance personnel, the operation and maintenance personnel can track the fault points in time and process the faults, the application fault recovery time is shortened, and the operation and maintenance efficiency and the service quality are improved.

Finally, the alarm messages are counted to obtain operation index data of each application system, the operation index data are comprehensively displayed, operation and maintenance personnel can obtain comprehensive and latest operation information, global control over each application is facilitated, and management efficiency is improved.

Preferably, as an improvement, the acquisition module further comprises:

the frequency setting module is used for setting acquisition frequencies of different applications;

and the timing log obtaining module is used for obtaining the log information of the appointed application from the ELK unified log platform at regular time according to the acquisition frequency.

The technical effects are as follows: according to the practical application condition, the higher acquisition frequency can be set for the application with the importance of more times, short time interval and higher abnormal condition occurrence, and the timeliness of abnormal acquisition is ensured.

Preferably, as an improvement, the identification module further comprises:

the alarm rule configuration module is used for configuring log alarm rules, wherein the log alarm rules comprise application alarm rules and abnormal event alarm rules;

the timing query module is used for carrying out timing query on the application state of each application and simultaneously carrying out timing query on the application abnormal log;

and the abnormal log processing module is used for merging, filtering and upgrading the abnormal logs.

The technical effects are as follows: different companies and different projects, the concerned abnormal events are different, and the alarm rule configuration is provided to facilitate personalized application; the application state is inquired at regular time, so that managers can conveniently master the current operation condition in time, adjust unreasonable places in time, inquire at regular time on abnormal days and monitor abnormal logs; the abnormal logs have the condition of event coincidence, and the combination of the abnormal events is beneficial to the conciseness of the operation flow; and filtering the abnormal events with low importance degree and low emergency degree, and upgrading the abnormal events with high importance degree and high emergency degree, so that the high-efficiency allocation of resources is facilitated.

Preferably, as an improvement, the application state includes a log alarm rule, a topological graph of the application, and a current latest state of the application.

The technical effects are as follows: according to the application topological graph, the nodes related to the application can be known, the abnormal event occurrence source can be backed up according to the nodes, the problem can be solved from the root, and the log alarm rule can be better set.

Preferably, as an improvement, the log alarm rule includes an applied index value configuration, a parameter and a parameter value for abnormal log identification, and a log refresh frequency.

The technical effects are as follows: and alarming the abnormal event according to the log alarming rule.

Preferably, as an improvement, the categories of the alarm message include an event alarm of the application, an application abnormal state notification, and an abnormal log alarm.

The technical effects are as follows: when one application meets the condition of one sending type, the message is sent to the notification object, and the generation of the abnormal event is considered from the three aspects of the application, the application event and the log content, so that the consideration is more comprehensive.

Preferably, as an improvement, the transceiver module further includes:

the self-defining module is used for self-defining the alarm grade and the alarm message template;

the fault troubleshooting module is used for positioning fault points according to the topological graph, troubleshooting abnormal reasons according to a fault troubleshooting table and matching abnormal operation manuals;

the anomaly analysis module is used for positioning a fault point of a new anomaly event according to the topological graph and carrying out correlation analysis on historical operation to obtain an operation suggestion for eliminating the anomaly;

and the operation suggestion sending module is used for sending the fault point, the abnormal operation removing operation manual and the operation suggestion to the operation and maintenance personnel.

The technical effects are as follows: according to the alarm event, the alarm level is set in a user-defined mode, and the alarm template is selected independently, so that personalized application can be realized, and the use quality is improved; and meanwhile, fault points and corresponding fault removing operation are positioned, so that operation and maintenance personnel can efficiently remove faults and recover normal operation.

Preferably, as an improvement, the transceiver module further includes:

and the new exception removing operation storage module is used for generating a new operation manual for the actual operation of the new exception event by the operation and maintenance personnel and storing the new operation manual into the fault list.

The technical effects are as follows: as the abnormal processing conditions increase, the types of the troubleshooting tables are richer, and the method is favorable for processing the same abnormal event in the follow-up process.

Preferably, as an improvement, the statistic module further includes:

and the operation index data generation module is used for counting the alarm events, the levels of the alarm events, the total number of alarms, the most common alarm event names, the alarm processing conditions and the alarm information sending conditions of each application according to the time period to generate operation index data.

The technical effects are as follows: the running conditions of all the applications are counted, so that the running conditions can be clearly and quickly mastered.

Preferably, as an improvement, the system further comprises a display frequency setting module, configured to set an update frequency of the integrated display; the display frequency setting module further includes:

the importance setting module is used for giving different weights to the running index data of each application, weighting to obtain an importance index, and setting the updating frequency of the display according to the importance index;

and the sorting module is used for sorting and displaying the latest alarm messages according to the importance indexes from large to small.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Detailed Description

The embodiment is basically as shown in the attached figure 1:

a unified monitoring management platform, comprising:

the acquisition module is used for acquiring logs of all applications, performing format conversion processing, performing unified storage and simultaneously storing a fault list; for the logs related to the infrastructure, the system automatically collects the logs from an infrastructure log auditing platform, and stores the logs to an ELK unified log platform after format conversion processing, so that operation and maintenance personnel can conveniently inquire the logs; the applications are independent from each other, so that logs of the applications need to be collected first, format conversion processing needs to be carried out for convenient and quick subsequent processing, and then unified storage is carried out; the abnormal events which occur once and the corresponding fault removing operation are stored in the fault list table, and operation and maintenance personnel can rapidly process the abnormal events according to the fault list table, so that the processing efficiency is improved.

The frequency setting module is used for setting acquisition frequencies of different applications; the application is a third-party system, and required data and information are acquired through an API (application programming interface) provided by a third party; the running states of the application components and the middleware are obtained from a Prometheus monitoring platform and a PAAS platform, and the running states of the host resources and the infrastructure part are obtained from an IAAS platform; acquiring log information from an ELK unified log platform; different application logs are different in updating condition, and during actual application, the acquisition frequency of different applications is set according to actual requirements, so that unnecessary work is avoided, and the operation efficiency is improved.

The timing log obtaining module is used for obtaining log information of specified application from the ELK unified log platform at regular time according to the collecting frequency, collecting, analyzing and storing the log information in various formats on the ELK unified log platform, and providing a visual display function.

The identification module is used for identifying abnormal log information and the running state of each system, acquiring abnormal application and abnormal events and generating alarm events with topological graphs and time; the identification module further comprises:

the alarm rule configuration module is used for configuring log alarm rules, wherein the log alarm rules comprise application alarm rules and abnormal event alarm rules; the log alarm rule comprises applied index value configuration, abnormal log identification parameters and parameter values, and log refreshing frequency. Different companies, different projects and different concerned abnormal events, and the alarm rule configuration is provided to facilitate personalized application.

The timing query module is used for carrying out timing query on the application state of each application and simultaneously carrying out timing query on the application abnormal log; the application state comprises log alarm rules, a topological graph of the application and the current latest state of the application. And the application state is inquired at regular time, so that a manager can master the current operation condition in time, make adjustment on unreasonable places in time, inquire at regular time on abnormal days and monitor abnormal logs.

And the abnormal log processing module is used for merging, filtering and upgrading the abnormal logs. Merging is to merge the same events of the same application in the same batch; simultaneously, automatically carrying out aggregation processing aiming at the event that the receiving party is the same person, and combining the events into a warning message for sending; the filtering is that in a designated period, the same event from the same application is judged according to the event grade, and the alarm event with the same grade or lower grade is automatically filtered; and in the upgrading process, when the unprocessed event reaches an upgrading threshold, the system automatically carries out upgrading treatment on the event and sends an alarm message to a receiver at a higher level. The abnormal logs are merged, filtered and upgraded, so that simplicity of the operation flow is facilitated, and efficient allocation of resources is facilitated.

The receiving and sending module is used for receiving the alarm event and sending an alarm message to operation and maintenance personnel; the transceiver module further comprises:

the self-defining module is used for self-defining the alarm grade and the alarm message template; according to the alarm event, the alarm level is set in a user-defined mode, the alarm template is selected independently, personalized application can be achieved, and the using quality is improved.

The fault troubleshooting module is used for positioning fault points according to the topological graph, troubleshooting abnormal reasons according to a fault troubleshooting table and matching abnormal operation manuals; the topological graph shows each instance node related to the application component, fault points can be checked according to the instance nodes, abnormal events which occur once and corresponding fault removal operations are stored in the fault check list, and operation and maintenance personnel can rapidly process the abnormal events according to the fault check list, so that the processing efficiency is improved.

The anomaly analysis module is used for positioning a fault point of a new anomaly event according to the topological graph and carrying out correlation analysis on historical operation to obtain an operation suggestion for eliminating the anomaly; when a new abnormal event occurs and the fault scheduling list cannot be inquired, analyzing historical operation, proposing suggestions for operation and maintenance personnel, and improving the efficiency of the operation and maintenance personnel for processing the new abnormal event.

And the new exception eliminating operation storage module is used for generating a new operation manual for the actual operation of the operation and maintenance personnel on the new exception event and storing the new operation manual into the fault list. As the abnormal processing conditions increase, the types of the troubleshooting tables are richer, and the method is favorable for processing the same abnormal event in the follow-up process.

And the counting module is used for counting the alarm messages to obtain the operation index data of each application system and comprehensively displaying the operation index data. The statistics module further comprises:

the operation index data generation module is used for counting the alarm events, the levels of the alarm events, the total number of alarms, the most common alarm event names, the alarm processing conditions and the alarm information sending conditions of each application according to the time period to generate operation index data; the running conditions of all the applications are counted, so that the running conditions can be clearly and quickly mastered. For example, the alarm event statistics include processed and unprocessed scores for alarm events; the level statistics of the alarm events comprise alarm event distribution of high, medium and low levels; the statistics of the total alarm number comprises the total alarm number of each application and the total number of each alarm type; alarm processing conditions comprise processed and unprocessed; the alarm information sending conditions comprise sent, upgraded and unsent.

In addition, it is necessary to count the abnormal states of the applications and to count the applications in the failure and stop states in a summary manner.

The display frequency setting module is used for setting the updating frequency of the comprehensive display; the display frequency setting module further includes:

the importance setting module is used for giving different weights to the running index data of each application, weighting to obtain an importance index, and setting the updating frequency of the display according to the importance index; the greater the importance index value is, the higher the importance is, the lower the corresponding updating frequency is, and the operation and maintenance personnel can be ensured to pay attention to the importance index value in time;

and the sequencing module is used for sequencing and displaying the latest alarm message from large to small according to the importance index and displaying the latest alarm event and the alarm duration. When abnormal conditions are more, the alarm messages with high importance degree are displayed at high frequency, the latest alarm events are sorted, and corresponding processing personnel can process the corresponding events efficiently.

The foregoing is merely an example of the present invention and common general knowledge in the art of designing and/or characterizing particular aspects and/or features is not described in any greater detail herein. It should be noted that, for those skilled in the art, without departing from the technical solution of the present invention, several variations and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims

1. A unified monitoring and management platform, comprising:

and the counting module is used for counting the alarm messages to obtain the operation index data of each application system and comprehensively displaying the operation index data.

2. The unified monitoring and management platform according to claim 1, wherein the collection module further comprises:

3. The unified monitoring and management platform according to claim 1, wherein said identification module further comprises:

4. The unified monitoring and management platform according to claim 3, wherein: the application state comprises log alarm rules, a topological graph of the application and the current latest state of the application.

5. The unified monitoring and management platform according to claim 3, wherein: the log alarm rule comprises applied index value configuration, abnormal log identification parameters and parameter values, and log refreshing frequency.

6. The unified monitoring and management platform according to claim 1, wherein: the types of the alarm messages comprise event alarms of the application, application abnormal state notifications and abnormal log alarms.

7. The unified monitoring and management platform according to claim 1, wherein said transceiver module further comprises:

the self-defining module is used for self-defining the alarm level and the alarm message template;

8. The unified monitoring and management platform according to claim 7, wherein the transceiver module further comprises:

9. The unified monitoring and management platform according to claim 1, wherein the statistics module further comprises:

10. The unified monitoring and management platform according to claim 1, wherein: the display device also comprises a display frequency setting module for setting the update frequency of the comprehensive display; the display frequency setting module further includes:

and the sequencing module is used for sequencing and displaying the latest alarm message according to the importance index from large to small.