CN115766417A - Unified monitoring management platform - Google Patents

Unified monitoring management platform Download PDF

Info

Publication number
CN115766417A
CN115766417A CN202211450255.9A CN202211450255A CN115766417A CN 115766417 A CN115766417 A CN 115766417A CN 202211450255 A CN202211450255 A CN 202211450255A CN 115766417 A CN115766417 A CN 115766417A
Authority
CN
China
Prior art keywords
alarm
module
application
abnormal
log
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211450255.9A
Other languages
Chinese (zh)
Inventor
陈剑锋
吴晔凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Huafa New Technology Investment Holdings Co ltd
Original Assignee
Zhuhai Huafa New Technology Investment Holdings Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Huafa New Technology Investment Holdings Co ltd filed Critical Zhuhai Huafa New Technology Investment Holdings Co ltd
Priority to CN202211450255.9A priority Critical patent/CN115766417A/en
Publication of CN115766417A publication Critical patent/CN115766417A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention relates to the technical field of internet and discloses a unified monitoring management platform. A unified monitoring management platform, comprising: the acquisition module is used for acquiring logs of all applications, performing format conversion processing, performing unified storage and simultaneously storing a fault list; the identification module is used for identifying abnormal log information and the running state of each system, acquiring abnormal application and abnormal events and generating alarm events with topological graphs and time; the receiving and sending module is used for receiving the alarm event and sending an alarm message to operation and maintenance personnel; and the counting module is used for counting the alarm messages to obtain the operation index data of each application system and comprehensively displaying the operation index data. The method and the device can perfect the capabilities in the aspects of application monitoring, alarming, fault positioning and solving and the like, and realize a three-dimensional monitoring system from the underlying network resources to the upper application.

Description

Unified monitoring management platform
Technical Field
The invention relates to the technical field of internet, in particular to a unified monitoring management platform.
Background
At present, teams related to operation and maintenance are mainly divided into a network hardware facility, an IAAS cloud platform and a general PAAS platform, wherein the network hardware facility monitors and manages logs and performance by using a log auditing platform provided by a third party; the IAAS cloud platform comprises a VMWare part and an OpenStack part, wherein the VMWare uses a self-contained vRealixe Suite cloud management platform to monitor and manage logs and performance; openStack uses the self-contained Celimeter and open source Zabbix collocation for performance monitoring, and the unified management in the aspect of log is not realized at present.
The monitoring platform and the log management system are independent from each other among the teams, the log, monitoring and alarm management are dispersed, and the information is asymmetric, so that once the application fails, operation and maintenance personnel can hardly obtain comprehensive and latest running information, the specific reasons of the failure can be hardly positioned effectively, and the problems of overlong application failure recovery time and the like are caused. In order to improve the operation and maintenance efficiency and the service quality and quickly and effectively locate the root cause of the application fault, a uniform monitoring platform needs to be built.
Disclosure of Invention
The invention aims to provide a unified monitoring management platform to improve the capabilities in the aspects of application monitoring, alarming, fault positioning and solving and the like and realize a three-dimensional monitoring system from underlying network resources to upper application.
In order to achieve the purpose, the invention adopts the following technical scheme: a unified monitoring management platform, comprising:
the acquisition module is used for acquiring logs of all applications, performing format conversion processing, performing unified storage and simultaneously storing fault troubleshooting tables;
the identification module is used for identifying abnormal log information and the running state of each system, acquiring abnormal application and abnormal events and generating alarm events with topological graphs and time;
the receiving and sending module is used for receiving the alarm event and sending an alarm message to operation and maintenance personnel;
and the statistical module is used for carrying out statistics on the alarm message to obtain the operation index data of each application system and carrying out comprehensive display.
The principle and the advantages of the scheme are as follows: in actual application, all applications are independent, so that logs of all applications need to be collected first, format conversion processing needs to be carried out for convenience and rapidness of subsequent processing, and then unified storage is carried out; the abnormal events which occur once and the corresponding fault removing operation are stored in the fault list table, and operation and maintenance personnel can rapidly process the abnormal events according to the fault list table, so that the processing efficiency is improved.
After the logs of each application are collected, log information and the running state of each system need to be subjected to abnormal identification, abnormal applications and abnormal events are obtained, the conditions of the abnormal applications are obtained, the topological relations of the applications are contained in the abnormal applications, fault points can be checked according to the topological relations, the abnormal events are obtained, alarm events with topological graphs and time are generated, alarm messages are sent to operation and maintenance personnel, the operation and maintenance personnel can track the fault points in time and process the faults, the application fault recovery time is shortened, and the operation and maintenance efficiency and the service quality are improved.
Finally, the alarm messages are counted to obtain operation index data of each application system, the operation index data are comprehensively displayed, operation and maintenance personnel can obtain comprehensive and latest operation information, global control over each application is facilitated, and management efficiency is improved.
Preferably, as an improvement, the acquisition module further comprises:
the frequency setting module is used for setting acquisition frequencies of different applications;
and the timing log obtaining module is used for obtaining the log information of the appointed application from the ELK unified log platform at regular time according to the acquisition frequency.
The technical effects are as follows: according to the practical application condition, the higher acquisition frequency can be set for the application with the importance of more times, short time interval and higher abnormal condition occurrence, and the timeliness of abnormal acquisition is ensured.
Preferably, as an improvement, the identification module further comprises:
the alarm rule configuration module is used for configuring log alarm rules, wherein the log alarm rules comprise application alarm rules and abnormal event alarm rules;
the timing query module is used for carrying out timing query on the application state of each application and simultaneously carrying out timing query on the application abnormal log;
and the abnormal log processing module is used for merging, filtering and upgrading the abnormal logs.
The technical effects are as follows: different companies and different projects, the concerned abnormal events are different, and the alarm rule configuration is provided to facilitate personalized application; the application state is inquired at regular time, so that managers can conveniently master the current operation condition in time, adjust unreasonable places in time, inquire at regular time on abnormal days and monitor abnormal logs; the abnormal logs have the condition of event coincidence, and the combination of the abnormal events is beneficial to the conciseness of the operation flow; and filtering the abnormal events with low importance degree and low emergency degree, and upgrading the abnormal events with high importance degree and high emergency degree, so that the high-efficiency allocation of resources is facilitated.
Preferably, as an improvement, the application state includes a log alarm rule, a topological graph of the application, and a current latest state of the application.
The technical effects are as follows: according to the application topological graph, the nodes related to the application can be known, the abnormal event occurrence source can be backed up according to the nodes, the problem can be solved from the root, and the log alarm rule can be better set.
Preferably, as an improvement, the log alarm rule includes an applied index value configuration, a parameter and a parameter value for abnormal log identification, and a log refresh frequency.
The technical effects are as follows: and alarming the abnormal event according to the log alarming rule.
Preferably, as an improvement, the categories of the alarm message include an event alarm of the application, an application abnormal state notification, and an abnormal log alarm.
The technical effects are as follows: when one application meets the condition of one sending type, the message is sent to the notification object, and the generation of the abnormal event is considered from the three aspects of the application, the application event and the log content, so that the consideration is more comprehensive.
Preferably, as an improvement, the transceiver module further includes:
the self-defining module is used for self-defining the alarm grade and the alarm message template;
the fault troubleshooting module is used for positioning fault points according to the topological graph, troubleshooting abnormal reasons according to a fault troubleshooting table and matching abnormal operation manuals;
the anomaly analysis module is used for positioning a fault point of a new anomaly event according to the topological graph and carrying out correlation analysis on historical operation to obtain an operation suggestion for eliminating the anomaly;
and the operation suggestion sending module is used for sending the fault point, the abnormal operation removing operation manual and the operation suggestion to the operation and maintenance personnel.
The technical effects are as follows: according to the alarm event, the alarm level is set in a user-defined mode, and the alarm template is selected independently, so that personalized application can be realized, and the use quality is improved; and meanwhile, fault points and corresponding fault removing operation are positioned, so that operation and maintenance personnel can efficiently remove faults and recover normal operation.
Preferably, as an improvement, the transceiver module further includes:
and the new exception removing operation storage module is used for generating a new operation manual for the actual operation of the new exception event by the operation and maintenance personnel and storing the new operation manual into the fault list.
The technical effects are as follows: as the abnormal processing conditions increase, the types of the troubleshooting tables are richer, and the method is favorable for processing the same abnormal event in the follow-up process.
Preferably, as an improvement, the statistic module further includes:
and the operation index data generation module is used for counting the alarm events, the levels of the alarm events, the total number of alarms, the most common alarm event names, the alarm processing conditions and the alarm information sending conditions of each application according to the time period to generate operation index data.
The technical effects are as follows: the running conditions of all the applications are counted, so that the running conditions can be clearly and quickly mastered.
Preferably, as an improvement, the system further comprises a display frequency setting module, configured to set an update frequency of the integrated display; the display frequency setting module further includes:
the importance setting module is used for giving different weights to the running index data of each application, weighting to obtain an importance index, and setting the updating frequency of the display according to the importance index;
and the sorting module is used for sorting and displaying the latest alarm messages according to the importance indexes from large to small.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention.
Detailed Description
The embodiment is basically as shown in the attached figure 1:
a unified monitoring management platform, comprising:
the acquisition module is used for acquiring logs of all applications, performing format conversion processing, performing unified storage and simultaneously storing a fault list; for the logs related to the infrastructure, the system automatically collects the logs from an infrastructure log auditing platform, and stores the logs to an ELK unified log platform after format conversion processing, so that operation and maintenance personnel can conveniently inquire the logs; the applications are independent from each other, so that logs of the applications need to be collected first, format conversion processing needs to be carried out for convenient and quick subsequent processing, and then unified storage is carried out; the abnormal events which occur once and the corresponding fault removing operation are stored in the fault list table, and operation and maintenance personnel can rapidly process the abnormal events according to the fault list table, so that the processing efficiency is improved.
The frequency setting module is used for setting acquisition frequencies of different applications; the application is a third-party system, and required data and information are acquired through an API (application programming interface) provided by a third party; the running states of the application components and the middleware are obtained from a Prometheus monitoring platform and a PAAS platform, and the running states of the host resources and the infrastructure part are obtained from an IAAS platform; acquiring log information from an ELK unified log platform; different application logs are different in updating condition, and during actual application, the acquisition frequency of different applications is set according to actual requirements, so that unnecessary work is avoided, and the operation efficiency is improved.
The timing log obtaining module is used for obtaining log information of specified application from the ELK unified log platform at regular time according to the collecting frequency, collecting, analyzing and storing the log information in various formats on the ELK unified log platform, and providing a visual display function.
The identification module is used for identifying abnormal log information and the running state of each system, acquiring abnormal application and abnormal events and generating alarm events with topological graphs and time; the identification module further comprises:
the alarm rule configuration module is used for configuring log alarm rules, wherein the log alarm rules comprise application alarm rules and abnormal event alarm rules; the log alarm rule comprises applied index value configuration, abnormal log identification parameters and parameter values, and log refreshing frequency. Different companies, different projects and different concerned abnormal events, and the alarm rule configuration is provided to facilitate personalized application.
The timing query module is used for carrying out timing query on the application state of each application and simultaneously carrying out timing query on the application abnormal log; the application state comprises log alarm rules, a topological graph of the application and the current latest state of the application. And the application state is inquired at regular time, so that a manager can master the current operation condition in time, make adjustment on unreasonable places in time, inquire at regular time on abnormal days and monitor abnormal logs.
And the abnormal log processing module is used for merging, filtering and upgrading the abnormal logs. Merging is to merge the same events of the same application in the same batch; simultaneously, automatically carrying out aggregation processing aiming at the event that the receiving party is the same person, and combining the events into a warning message for sending; the filtering is that in a designated period, the same event from the same application is judged according to the event grade, and the alarm event with the same grade or lower grade is automatically filtered; and in the upgrading process, when the unprocessed event reaches an upgrading threshold, the system automatically carries out upgrading treatment on the event and sends an alarm message to a receiver at a higher level. The abnormal logs are merged, filtered and upgraded, so that simplicity of the operation flow is facilitated, and efficient allocation of resources is facilitated.
The receiving and sending module is used for receiving the alarm event and sending an alarm message to operation and maintenance personnel; the transceiver module further comprises:
the self-defining module is used for self-defining the alarm grade and the alarm message template; according to the alarm event, the alarm level is set in a user-defined mode, the alarm template is selected independently, personalized application can be achieved, and the using quality is improved.
The fault troubleshooting module is used for positioning fault points according to the topological graph, troubleshooting abnormal reasons according to a fault troubleshooting table and matching abnormal operation manuals; the topological graph shows each instance node related to the application component, fault points can be checked according to the instance nodes, abnormal events which occur once and corresponding fault removal operations are stored in the fault check list, and operation and maintenance personnel can rapidly process the abnormal events according to the fault check list, so that the processing efficiency is improved.
The anomaly analysis module is used for positioning a fault point of a new anomaly event according to the topological graph and carrying out correlation analysis on historical operation to obtain an operation suggestion for eliminating the anomaly; when a new abnormal event occurs and the fault scheduling list cannot be inquired, analyzing historical operation, proposing suggestions for operation and maintenance personnel, and improving the efficiency of the operation and maintenance personnel for processing the new abnormal event.
And the operation suggestion sending module is used for sending the fault point, the abnormal operation removing operation manual and the operation suggestion to the operation and maintenance personnel.
And the new exception eliminating operation storage module is used for generating a new operation manual for the actual operation of the operation and maintenance personnel on the new exception event and storing the new operation manual into the fault list. As the abnormal processing conditions increase, the types of the troubleshooting tables are richer, and the method is favorable for processing the same abnormal event in the follow-up process.
And the counting module is used for counting the alarm messages to obtain the operation index data of each application system and comprehensively displaying the operation index data. The statistics module further comprises:
the operation index data generation module is used for counting the alarm events, the levels of the alarm events, the total number of alarms, the most common alarm event names, the alarm processing conditions and the alarm information sending conditions of each application according to the time period to generate operation index data; the running conditions of all the applications are counted, so that the running conditions can be clearly and quickly mastered. For example, the alarm event statistics include processed and unprocessed scores for alarm events; the level statistics of the alarm events comprise alarm event distribution of high, medium and low levels; the statistics of the total alarm number comprises the total alarm number of each application and the total number of each alarm type; alarm processing conditions comprise processed and unprocessed; the alarm information sending conditions comprise sent, upgraded and unsent.
In addition, it is necessary to count the abnormal states of the applications and to count the applications in the failure and stop states in a summary manner.
The display frequency setting module is used for setting the updating frequency of the comprehensive display; the display frequency setting module further includes:
the importance setting module is used for giving different weights to the running index data of each application, weighting to obtain an importance index, and setting the updating frequency of the display according to the importance index; the greater the importance index value is, the higher the importance is, the lower the corresponding updating frequency is, and the operation and maintenance personnel can be ensured to pay attention to the importance index value in time;
and the sequencing module is used for sequencing and displaying the latest alarm message from large to small according to the importance index and displaying the latest alarm event and the alarm duration. When abnormal conditions are more, the alarm messages with high importance degree are displayed at high frequency, the latest alarm events are sorted, and corresponding processing personnel can process the corresponding events efficiently.
The foregoing is merely an example of the present invention and common general knowledge in the art of designing and/or characterizing particular aspects and/or features is not described in any greater detail herein. It should be noted that, for those skilled in the art, without departing from the technical solution of the present invention, several variations and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims (10)

1. A unified monitoring and management platform, comprising:
the acquisition module is used for acquiring logs of all applications, performing format conversion processing, performing unified storage and simultaneously storing fault troubleshooting tables;
the identification module is used for identifying abnormal log information and the running state of each system, acquiring abnormal application and abnormal events and generating alarm events with topological graphs and time;
the receiving and sending module is used for receiving the alarm event and sending an alarm message to operation and maintenance personnel;
and the counting module is used for counting the alarm messages to obtain the operation index data of each application system and comprehensively displaying the operation index data.
2. The unified monitoring and management platform according to claim 1, wherein the collection module further comprises:
the frequency setting module is used for setting acquisition frequencies of different applications;
and the timing log obtaining module is used for obtaining the log information of the appointed application from the ELK unified log platform at regular time according to the acquisition frequency.
3. The unified monitoring and management platform according to claim 1, wherein said identification module further comprises:
the alarm rule configuration module is used for configuring log alarm rules, wherein the log alarm rules comprise application alarm rules and abnormal event alarm rules;
the timing query module is used for carrying out timing query on the application state of each application and simultaneously carrying out timing query on the application abnormal log;
and the abnormal log processing module is used for merging, filtering and upgrading the abnormal logs.
4. The unified monitoring and management platform according to claim 3, wherein: the application state comprises log alarm rules, a topological graph of the application and the current latest state of the application.
5. The unified monitoring and management platform according to claim 3, wherein: the log alarm rule comprises applied index value configuration, abnormal log identification parameters and parameter values, and log refreshing frequency.
6. The unified monitoring and management platform according to claim 1, wherein: the types of the alarm messages comprise event alarms of the application, application abnormal state notifications and abnormal log alarms.
7. The unified monitoring and management platform according to claim 1, wherein said transceiver module further comprises:
the self-defining module is used for self-defining the alarm level and the alarm message template;
the fault troubleshooting module is used for positioning fault points according to the topological graph, troubleshooting abnormal reasons according to a fault troubleshooting table and matching abnormal operation manuals;
the anomaly analysis module is used for positioning a fault point of a new anomaly event according to the topological graph and carrying out correlation analysis on historical operation to obtain an operation suggestion for eliminating the anomaly;
and the operation suggestion sending module is used for sending the fault point, the abnormal operation removing operation manual and the operation suggestion to the operation and maintenance personnel.
8. The unified monitoring and management platform according to claim 7, wherein the transceiver module further comprises:
and the new exception removing operation storage module is used for generating a new operation manual for the actual operation of the new exception event by the operation and maintenance personnel and storing the new operation manual into the fault list.
9. The unified monitoring and management platform according to claim 1, wherein the statistics module further comprises:
and the operation index data generation module is used for counting the alarm events, the levels of the alarm events, the total number of alarms, the most common alarm event names, the alarm processing conditions and the alarm information sending conditions of each application according to the time period to generate operation index data.
10. The unified monitoring and management platform according to claim 1, wherein: the display device also comprises a display frequency setting module for setting the update frequency of the comprehensive display; the display frequency setting module further includes:
the importance setting module is used for giving different weights to the running index data of each application, weighting to obtain an importance index, and setting the updating frequency of the display according to the importance index;
and the sequencing module is used for sequencing and displaying the latest alarm message according to the importance index from large to small.
CN202211450255.9A 2022-11-19 2022-11-19 Unified monitoring management platform Pending CN115766417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211450255.9A CN115766417A (en) 2022-11-19 2022-11-19 Unified monitoring management platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211450255.9A CN115766417A (en) 2022-11-19 2022-11-19 Unified monitoring management platform

Publications (1)

Publication Number Publication Date
CN115766417A true CN115766417A (en) 2023-03-07

Family

ID=85373747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211450255.9A Pending CN115766417A (en) 2022-11-19 2022-11-19 Unified monitoring management platform

Country Status (1)

Country Link
CN (1) CN115766417A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116192612A (en) * 2023-04-23 2023-05-30 成都新西旺自动化科技有限公司 System fault monitoring and early warning system and method based on log analysis
CN116599822A (en) * 2023-07-18 2023-08-15 云筑信息科技(成都)有限公司 Fault alarm treatment method based on log acquisition event

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116192612A (en) * 2023-04-23 2023-05-30 成都新西旺自动化科技有限公司 System fault monitoring and early warning system and method based on log analysis
CN116599822A (en) * 2023-07-18 2023-08-15 云筑信息科技(成都)有限公司 Fault alarm treatment method based on log acquisition event
CN116599822B (en) * 2023-07-18 2023-10-20 云筑信息科技(成都)有限公司 Fault alarm treatment method based on log acquisition event

Similar Documents

Publication Publication Date Title
CN115766417A (en) Unified monitoring management platform
CN109408347B (en) A kind of index real-time analyzer and index real-time computing technique
CN107423198B (en) EAM platform monitoring management method and system
CN104407964B (en) A kind of centralized monitoring system and method based on data center
CN106649040A (en) Automatic monitoring method and device for performance of Weblogic middleware
CN112416724B (en) Alarm processing method, system, computer device and storage medium
CN106940677A (en) One kind application daily record data alarm method and device
CN109218102A (en) A kind of alarm monitoring method and system
CN111459763A (en) Cross-kubernets cluster monitoring system and method
CN101989931A (en) Operation alarm processing method and device
CN106533792A (en) Method and device for monitoring and configuring resources
CN108880842A (en) A kind of fault rootstock analyzing and positioning system and analysis method automating operation platform
CN112559376A (en) Automatic positioning method and device for database fault and electronic equipment
CN113377559A (en) Big data based exception handling method, device, equipment and storage medium
CN112199394A (en) Alarm information pushing method and system, intelligent terminal and storage medium
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN103986607A (en) Voice-sound-light alarm monitoring system for intelligent data center
CN116205396A (en) Data panoramic monitoring method and system based on data center
CN106951360B (en) Data statistical integrity calculation method and system
CN116010456A (en) Equipment processing method, server and rail transit system
CN111198902B (en) Metadata management method and device, storage medium and electronic equipment
CN109522349B (en) Cross-type data calculation and sharing method, system and equipment
CN113434366A (en) Event processing method and system
CN111371574A (en) Operation and maintenance management system platform for monitoring machine room
CN114374600A (en) Network operation and maintenance method, device, equipment and product based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination