CN113381884B - Full link monitoring method and device for monitoring alarm system - Google Patents
Full link monitoring method and device for monitoring alarm system Download PDFInfo
- Publication number
- CN113381884B CN113381884B CN202110612028.0A CN202110612028A CN113381884B CN 113381884 B CN113381884 B CN 113381884B CN 202110612028 A CN202110612028 A CN 202110612028A CN 113381884 B CN113381884 B CN 113381884B
- Authority
- CN
- China
- Prior art keywords
- alarm
- monitoring
- self
- trigger
- index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0604—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
- H04L41/0622—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time based on time
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0803—Configuration setting
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
- Telephonic Communication Services (AREA)
- Alarm Systems (AREA)
Abstract
The invention discloses a full-link monitoring method and a full-link monitoring device for a monitoring alarm system, relates to the technical field of internet operation and maintenance, and realizes continuous monitoring of a full link while improving the reliability of the monitoring alarm system and a self-monitoring system. The method comprises the following steps: respectively deploying a monitoring alarm system and a self-monitoring system in different environments; configuring an alarm index of a monitoring alarm system and a trigger frequency of a self-monitoring system; the self-monitoring system periodically acquires alarm indexes from the monitoring alarm system according to the trigger frequency, and judges whether the alarm trigger time of the monitoring alarm system is overtime or not according to the timestamp and the current timestamp when the business alarm indexes in the alarm indexes are generated; the monitoring alarm system periodically pulls monitoring indexes from the self-monitoring system according to the trigger frequency and judges whether the alarm trigger state of the self-monitoring system is normal or not; and based on the alarm triggering time and the judgment result of the alarm triggering state, regularly sending an alarm triggering result to a set receiver according to the triggering frequency.
Description
Technical Field
The invention relates to the technical field of internet operation and maintenance, in particular to a full link monitoring method and device for a monitoring alarm system.
Background
The monitoring alarm is the most important link in operation and maintenance, and problems are discovered through continuous information acquisition, convergence and analysis, so that the purposes of early warning and discovering faults in advance in time and providing detailed data for tracing and positioning problems afterwards are achieved. The monitoring of the monitoring alarm system is also called self-monitoring of the monitoring alarm system, so as to find the fault state of the monitoring alarm system in time and inform relevant responsible persons of timely handling and maintenance. The monitoring schemes for monitoring and warning systems in the prior art are as follows:
1. by additionally deploying a third-party monitoring component, such as a self-monitoring service (dead eye) used by the open-falcon, and adding a process survival monitoring alarm to the dead eye after deployment is finished, the open-falcon and a monitoring alarm system can be monitored mutually, and any party can sense the problem in time;
2. the method includes the steps that a plurality of sets of monitoring systems are deployed to achieve mutual monitoring, for example, at least two independent Prometheus instances are monitored in a cross mode, each Prometheus pulls indexes of all other Prometheus, and once one Prometheus goes down, other Prometheus can find out and alarm.
Therefore, most of the self-monitoring components or self-monitoring systems in the existing self-monitoring schemes are deployed on the same infrastructure as the monitoring alarm system, and once the infrastructure fails, the monitoring schemes fail, so that the purpose of self-monitoring cannot be achieved. In addition, because the link for monitoring the alarm is long, the alarm is notified to a responsible person from the acquisition of the index information, if any step has a problem, the monitoring alarm is disabled, and the existing self-monitoring scheme only monitors the state of the system and lacks the monitoring of the whole link.
Disclosure of Invention
The invention aims to provide a full-link monitoring method and a full-link monitoring device for a monitoring alarm system, which can realize continuous monitoring of a full link while improving the reliability of the monitoring alarm system and a self-monitoring system.
A first aspect of the present invention provides a full link monitoring method for monitoring an alarm system, including:
respectively deploying a monitoring alarm system and a self-monitoring system in different environments to enable the monitoring alarm system and the self-monitoring system to monitor each other;
configuring an alarm index of a monitoring alarm system and a trigger frequency of a self-monitoring system;
the self-monitoring system periodically acquires alarm indexes from the monitoring alarm system according to the trigger frequency, and judges whether the alarm trigger time of the monitoring alarm system is overtime or not according to the timestamp and the current timestamp when the business alarm indexes in the alarm indexes are generated;
the monitoring alarm system periodically pulls a monitoring index from the self-monitoring system according to the trigger frequency, compares the monitoring index with a preset threshold value and judges whether the alarm trigger state of the self-monitoring system is normal or not;
and sending alarm trigger results to a set receiver periodically according to the trigger frequency based on the alarm trigger time and the judgment result of the alarm trigger state.
Preferably, the method for respectively deploying the monitoring alarm system and the self-monitoring program in different environments comprises the following steps:
the monitoring alarm system is deployed on a target machine, and the self-monitoring system is deployed on a server-free framework of the cloud service.
Preferably, the method for configuring the alarm index of the monitoring alarm system and the trigger frequency of the self-monitoring system includes:
the alarm indexes of the configured monitoring alarm system comprise hardware alarm indexes, service alarm indexes and application program alarm indexes;
the configured trigger frequency of the self-monitoring system refers to that a hardware alarm index, a service alarm index and an application alarm index are obtained from the monitoring alarm system regularly according to the trigger frequency.
Further, the method for the self-monitoring system to periodically obtain the alarm index from the monitoring alarm system according to the trigger frequency and judge whether the alarm trigger time of the monitoring alarm system is overtime according to the timestamp and the current timestamp when the service alarm index in the alarm index is generated comprises the following steps:
the self-monitoring system checks whether the acquired alarm indexes simultaneously have a hardware alarm index, a business index alarm index and an application program alarm index, and when the acquired alarm indexes simultaneously exist, the self-monitoring system checks the acquired alarm indexes based on the time difference value between the timestamp when the business alarm index is generated and the current timestamp when the self-monitoring system acquires the alarm index,
and when the time difference value does not exceed the first threshold value, judging that the alarm triggering time of the monitoring alarm system is not overtime, otherwise, judging that the alarm triggering time of the monitoring alarm system is overtime.
Further, the method for the self-monitoring system to periodically obtain the alarm index from the monitoring alarm system according to the trigger frequency and judge whether the alarm trigger time of the monitoring alarm system is overtime according to the timestamp and the current timestamp when the business alarm index in the alarm index is generated further comprises the following steps:
the self-monitoring system checks whether alarm triggering records of hardware alarm indexes, service alarm indexes and application program alarm indexes exist in a database of the monitoring alarm system, and meanwhile, the self-monitoring system is based on the time difference value of a timestamp when the service alarm indexes in the database are generated and a timestamp when the alarm triggering records are written into the database;
and when the time difference does not exceed the second threshold, judging that the alarm triggering time of the monitoring alarm system is not overtime, otherwise, judging that the alarm triggering time of the monitoring alarm system is overtime.
Further, the method for the monitoring alarm system to periodically pull the monitoring index from the self-monitoring system according to the trigger frequency and compare the monitoring index with the preset threshold value to judge whether the alarm trigger state of the self-monitoring system is normal or not includes the following steps:
the monitoring alarm system periodically pulls monitoring indexes from the self-monitoring system according to the trigger frequency, wherein the monitoring indexes comprise one or more of the total calling times, the wrong calling times, the calling processing time and the calling memory consumption of the self-monitoring system in a trigger frequency period;
the monitoring alarm system compares and judges the pulled monitoring indexes with preset threshold values in a one-to-one correspondence mode, when any index exceeds the preset threshold value, the alarm trigger state of the self-monitoring system is considered to be abnormal, and otherwise, the alarm trigger state of the self-monitoring system is considered to be normal.
Optionally, after the alarm triggering result is periodically sent to the set receiver according to the triggering frequency based on the alarm triggering time and the judgment result of the alarm triggering state, the method further includes:
the self-monitoring system periodically checks a receipt record table of the notification system and judges whether the alarm triggering result is successfully sent to a specified receiver;
and if the judgment result is that the alarm trigger result is not successfully sent to the preset receiver, the alarm trigger result is sent to a responsible person configured in advance in the notification system again.
Illustratively, the monitoring alarm system comprises a prometheus system, and the server-free architecture of the cloud service is a Serverless service.
Compared with the prior art, the full link monitoring method for monitoring the alarm system provided by the invention has the following beneficial effects:
the invention provides a full link monitoring method for monitoring an alarm system, which comprises the steps of respectively deploying a monitoring alarm system and a self-monitoring system in different environments, wherein the monitoring alarm system not only can monitor a target machine, but also can realize the mutual monitoring of the monitoring alarm system and the self-monitoring system, specifically, configuring an alarm index of the monitoring alarm system and a trigger frequency of the self-monitoring system in advance, then obtaining the alarm index from the monitoring alarm system by the self-monitoring system according to the trigger frequency regularly, judging whether the alarm trigger time of the monitoring alarm system is overtime or not according to a timestamp and a current timestamp when a service alarm index in the alarm index is generated, simultaneously pulling the monitoring index from the self-monitoring system by the monitoring alarm system according to the trigger frequency regularly, comparing the monitoring index with a preset threshold value to judge whether the alarm trigger state of the self-monitoring system is normal or not, and finally sending an alarm trigger result to a set receiver according to the trigger frequency based on the judgment result of the alarm trigger time and the alarm trigger state.
Therefore, the monitoring alarm system and the self-monitoring system are not deployed on the same infrastructure, so that the risk of the failure of the monitoring alarm system and the self-monitoring system can be effectively reduced, and the tracking monitoring can be performed on whether the receiver successfully receives the alarm triggering result, thereby realizing the full-link monitoring.
A second aspect of the present invention provides a full link monitoring apparatus for monitoring an alarm system, which is applied in the full link monitoring method for monitoring an alarm system in the foregoing technical solution, and the apparatus includes:
the deployment unit is used for respectively deploying the monitoring alarm system and the self-monitoring system in different environments so as to enable the monitoring alarm system and the self-monitoring system to monitor each other;
the configuration unit is used for configuring the alarm index of the monitoring alarm system and the trigger frequency of the self-monitoring system;
the first monitoring unit is used for periodically acquiring an alarm index from the monitoring alarm system through the self-monitoring system according to the trigger frequency, and judging whether the alarm trigger time of the monitoring alarm system is overtime or not according to a timestamp and a current timestamp when a service alarm index in the alarm index is generated;
the second monitoring unit is used for pulling a monitoring index from the self-monitoring system periodically according to the trigger frequency through the monitoring alarm system, and comparing the monitoring index with a preset threshold value to judge whether the alarm trigger state of the self-monitoring system is normal or not;
and the sending detection unit is used for sending the alarm triggering result to a set receiver periodically according to the triggering frequency based on the alarm triggering time and the judgment result of the alarm triggering state.
Compared with the prior art, the beneficial effects of the full link monitoring device for monitoring the alarm system provided by the invention are the same as the beneficial effects of the full link monitoring method for monitoring the alarm system provided by the technical scheme, and the details are not repeated herein.
A third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, performs the steps of the above-mentioned full link monitoring method for monitoring an alarm system.
Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the invention are the same as the beneficial effects of the full-link monitoring method for monitoring the alarm system provided by the technical scheme, and the detailed description is omitted here.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
fig. 1 is a schematic flow chart of a full link monitoring method for monitoring an alarm system according to an embodiment of the present invention;
FIG. 2 is a task flow diagram of a monitoring alarm system monitored by a self-monitoring system in an embodiment of the present invention;
fig. 3 is a task flow chart of the monitoring self-monitoring system of the monitoring alarm system in the embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1, the present embodiment provides a full link monitoring method for monitoring an alarm system, including:
respectively deploying a monitoring alarm system and a self-monitoring system in different environments to enable the monitoring alarm system and the self-monitoring system to monitor each other; configuring an alarm index of a monitoring alarm system and a trigger frequency of a self-monitoring system; the self-monitoring system periodically acquires alarm indexes from the monitoring alarm system according to the trigger frequency, and judges whether the alarm trigger time of the monitoring alarm system is overtime or not according to the timestamp and the current timestamp when the business alarm indexes in the alarm indexes are generated; the monitoring alarm system periodically pulls a monitoring index from the self-monitoring system according to the trigger frequency, compares the monitoring index with a preset threshold value and judges whether the alarm trigger state of the self-monitoring system is normal or not; and based on the alarm triggering time and the judgment result of the alarm triggering state, regularly sending an alarm triggering result to a set receiver according to the triggering frequency.
In the full-link monitoring method for monitoring an alarm system provided by this embodiment, a monitoring alarm system and a self-monitoring system are respectively deployed in different environments, where the monitoring alarm system can monitor not only a target machine but also can realize mutual monitoring of the monitoring alarm system and the self-monitoring system, specifically, an alarm index of the monitoring alarm system and a trigger frequency of the self-monitoring system are configured in advance, then the self-monitoring system periodically obtains the alarm index from the monitoring alarm system according to the trigger frequency, determines whether an alarm trigger time of the monitoring alarm system is overtime according to a timestamp and a current timestamp when a service alarm index in the alarm index is generated, meanwhile, the monitoring alarm system periodically pulls the monitoring index from the self-monitoring system according to the trigger frequency, compares the monitoring index with a preset threshold to determine whether an alarm trigger state of the self-monitoring system is normal, and finally, based on a determination result of the alarm trigger time and the trigger state, periodically sends an alarm trigger result to a set receiver according to the trigger frequency.
As can be seen, in the embodiment, since the monitoring alarm system and the self-monitoring system are not deployed on the same infrastructure, the risk of complete failure of the monitoring alarm system and the self-monitoring system can be effectively reduced, and meanwhile, whether the receiver successfully receives the alarm trigger result can be tracked and monitored, so that full link monitoring is realized.
In the above embodiment, the method for respectively deploying the monitoring alarm system and the self-monitoring program in different environments includes:
the monitoring alarm system is deployed on a target machine, and the self-monitoring system is deployed on a server-free framework of the cloud service.
Illustratively, the monitoring alarm system comprises a prometheus system, and the Serverless architecture of the cloud service is a Serverless service.
In specific implementation, the monitoring and warning system comprises a Prometheus and an alert manager, the Prometheus and the alert manager are respectively deployed on the target machine and are in signal connection with the target machine, and are simultaneously built in the same infrastructure with the target machine, and the monitoring and warning system can continuously monitor and warn the target machine and acquire warning indexes. The self-monitoring system is deployed on a server-free framework of the cloud service, namely, the Serverless service, and is in signal connection with the self-monitoring system by using Prometheus, so that the self-monitoring system can regularly acquire alarm indexes through Prometheus to realize monitoring of alarm triggering time of the self-monitoring alarm system. Preferably, the self-monitoring system is a self-monitoring program.
It can be seen that, in the present embodiment, the monitoring alarm system and the self-monitoring system are respectively deployed in different environments, so that the risk of total failures of the monitoring alarm system and the self-monitoring system can be effectively reduced, and the self-monitoring system is deployed on the serverless service.
In the above embodiment, the method for configuring the alarm index of the monitoring alarm system and the trigger frequency of the self-monitoring system includes:
alarm indexes of the configured monitoring alarm system comprise hardware alarm indexes, service alarm indexes and application program alarm indexes; the configured trigger frequency of the self-monitoring system refers to that a hardware alarm index, a service alarm index and an application alarm index are obtained from the monitoring alarm system regularly according to the trigger frequency.
In specific implementation, three alarm categories, namely a hardware alarm, a Tomcat application alarm and a service alarm, are configured through a monitoring alarm system, wherein the hardware alarm is used for monitoring resource consumption conditions of a CPU (central processing unit), a memory, network flow and the like of a server of a target machine and alarms when the resource consumption conditions are larger than a threshold value, an application program alarm is used for monitoring a jvm state of the target machine and judging whether an application health check interface is normal or not, the state is normal, namely the alarm, the service alarm is an alarm which is actively reported by the Tomcat application and takes a current timestamp as a service value, and the index is larger than 0, namely the alarm.
Generally speaking, when a target machine works normally, a hardware alarm index, a service alarm index and an application alarm index are all larger than a threshold value 0, so that all the three alarm indexes are triggered all the time, a self-monitoring system can regularly check whether the three alarms are normally issued to a receiver terminal, and the issuing modes comprise short message alarm reminding, mail alarm reminding, communication software alarm reminding and the like.
In the above embodiment, the method for periodically obtaining the alarm indicator from the monitoring alarm system by the self-monitoring system according to the trigger frequency, and determining whether the alarm trigger time of the monitoring alarm system is overtime according to the timestamp and the current timestamp when the service alarm indicator in the alarm indicator is generated includes:
the self-monitoring system checks whether the acquired alarm indexes simultaneously have a hardware alarm index, a service alarm index and an application program alarm index, and when the alarm indexes simultaneously exist, the self-monitoring system determines the time difference value of a timestamp generated based on the service alarm index and a current timestamp obtained when the self-monitoring system acquires the alarm index; and when the time difference value does not exceed the first threshold value, judging that the alarm triggering time of the monitoring alarm system is not overtime, otherwise, judging that the alarm triggering time of the monitoring alarm system is overtime.
Referring to fig. 2, in an implementation, the self-monitoring system is deployed on a Serverless service, and the configuration triggering is performed once every n minutes. The task flow of the self-monitoring system according to the monitoring and warning system is as follows:
1. checking whether the three alarm indexes exist on Prometheus at the same time, if not, indicating that the alarm is abnormal, if so, continuously checking the timestamp of the service alarm index and the current time to compare to obtain a difference value, if the difference value does not exceed a first threshold value, indicating that the alarm trigger is not overtime, otherwise, indicating that the alarm trigger is overtime;
2. checking whether a database of a monitoring alarm system has trigger records of three alarms, comparing a timestamp inserted into the database with a timestamp of a service alarm index to calculate a difference value, if the difference value does not exceed a second threshold value, indicating that the alarm trigger is not overtime, otherwise, indicating that the alarm trigger is overtime;
3. checking whether the notification system successfully sends the three alarms in each channel mode;
4. checking the receipt record table of the notification system, checking the receipt record for the notification modes (short messages and telephones) of the receipt channels, and judging whether the notification is successfully issued to the receiver terminal.
5. And if any one of the steps is abnormal, informing the relevant responsible person through a notification mode of the cloud monitoring service, and further realizing the full link monitoring of the alarm process.
The detection process adopts a pipeline mode to disassemble the whole monitoring alarm link into a plurality of links and polls the state of each step, thereby detecting the whole link, not only confirming whether the alarm notification reaches a receiver, but also positioning the specific step of failure.
In the above embodiment, the method for the monitoring alarm system to periodically pull the monitoring index from the self-monitoring system according to the trigger frequency, and compare the monitoring index with the preset threshold to determine whether the alarm trigger state of the self-monitoring system is normal includes:
the monitoring alarm system periodically pulls monitoring indexes from the self-monitoring system according to the trigger frequency, wherein the monitoring indexes comprise one or more of the total calling times, wrong calling times, calling processing time and calling memory consumption of the self-monitoring system in a trigger frequency period; the monitoring alarm system compares and judges the pulled monitoring indexes with preset threshold values in a one-to-one correspondence mode, when any index exceeds the preset threshold value, the alarm trigger state of the self-monitoring system is considered to be abnormal, and otherwise, the alarm trigger state of the self-monitoring system is considered to be normal.
Referring to fig. 3, in specific implementation, the task flow of the monitoring and alarming system monitoring self-monitoring system is as follows:
1. prometheus draws metrics from a monitored system, including: the total number of calls in n minutes, the number of call errors in n minutes, the call processing time (function execution time) and the memory consumption;
2. prometheus configures an alarm of a self-monitoring system, and alarms when any one or more conditions of the total calling times within n minutes being 0, the calling error times within n minutes being more than 0, the function execution time being more than 30s and the memory consumption being more than 500M occur;
3. prometheus triggers an alarm, informs an alert manager to remove duplicate suppression silence operation on an alarm group, reduces the disturbance rate of a user for receiving the alarm notification, and then calls a monitoring alarm system; grouping is to assemble the alarms of a plurality of database instances into one alarm, wherein suppression means that if a certain machine is hung, all applications deployed on the machine do not alarm and only the machine is alarmed, and silence means that no alarm is given on a non-working day and only an alarm is given on a working day.
4. The monitoring alarm system searches for a corresponding receiver and selects different notification modes according to the alarm level;
5. and calling a notification system to send the alarm notification to a terminal of a receiver in a selected notification mode.
In the above embodiment, after sending the alarm trigger result to the set receiver periodically according to the trigger frequency based on the alarm trigger time and the judgment result of the alarm trigger state, the method further includes:
the self-monitoring system periodically checks a receipt record table of the notification system and judges whether the alarm triggering result is successfully sent to a specified receiver; and if the judgment result is that the alarm trigger result is not successfully sent to the preset receiver, the alarm trigger result is sent to a responsible person configured in advance in the notification system again.
When the alarm triggering result cannot be normally sent to a preset appointed receiver, the abnormal phenomenon is timely sent to the responsible person so as to inform the responsible person of troubleshooting and maintenance of the related system.
Example two
The embodiment provides a full link monitoring device for monitoring an alarm system, which includes:
the deployment unit is used for respectively deploying the monitoring alarm system and the self-monitoring system in different environments so as to enable the monitoring alarm system and the self-monitoring system to monitor each other;
the configuration unit is used for configuring the alarm indexes of the monitoring alarm system and the trigger frequency of the self-monitoring system;
the first monitoring unit is used for periodically acquiring an alarm index from the monitoring alarm system through the self-monitoring system according to the trigger frequency, and judging whether the alarm trigger time of the monitoring alarm system is overtime or not according to a timestamp and a current timestamp when a service alarm index in the alarm index is generated;
the second monitoring unit is used for pulling a monitoring index from the self-monitoring system periodically according to the trigger frequency through the monitoring alarm system, and comparing the monitoring index with a preset threshold value to judge whether the alarm trigger state of the self-monitoring system is normal or not;
and the sending detection unit is used for sending the alarm triggering result to a set receiver periodically according to the triggering frequency based on the alarm triggering time and the judgment result of the alarm triggering state.
Compared with the prior art, the beneficial effects of the full link monitoring device for monitoring the alarm system provided by the invention are the same as the beneficial effects of the full link monitoring method for monitoring the alarm system provided by the technical scheme, and the detailed description is omitted here.
In the above embodiment, the method for deploying the monitoring alarm system and the self-monitoring program in different environments respectively includes:
the monitoring alarm system is deployed on a target machine, and the self-monitoring system is deployed on a server-free framework of the cloud service.
Illustratively, the monitoring alarm system comprises a prometheus system, and the Serverless architecture of the cloud service is a Serverless service.
In specific implementation, the monitoring and warning system comprises a Prometheus and an alert manager, the Prometheus and the alert manager are respectively deployed on the target machine and are in signal connection with the target machine, meanwhile, the Prometheus and the alert manager are built in the same infrastructure with the target machine, and continuous monitoring and warning can be performed on the target machine through the monitoring and warning system, and warning indexes can be obtained. The self-monitoring system is deployed on a server-free framework of the cloud service, namely, the Serverless service, and is in signal connection with the self-monitoring system by using Prometheus, so that the self-monitoring system can regularly acquire alarm indexes through Prometheus to realize monitoring of alarm triggering time of the self-monitoring alarm system. Preferably, the self-monitoring system is a self-monitoring program.
It can be seen that, in the present embodiment, the monitoring alarm system and the self-monitoring system are respectively deployed in different environments, so that the risk of total failures of the monitoring alarm system and the self-monitoring system can be effectively reduced, and the self-monitoring system is deployed on the serverless service.
In the above embodiment, the method for configuring the alarm index of the monitoring alarm system and the trigger frequency of the self-monitoring system includes:
the alarm indexes of the configured monitoring alarm system comprise hardware alarm indexes, service alarm indexes and application program alarm indexes; the configured triggering frequency of the self-monitoring system refers to that a hardware alarm index, a service alarm index and an application program alarm index are obtained from the monitoring alarm system regularly according to the triggering frequency.
In specific implementation, three alarm categories, namely a hardware alarm, a Tomcat application alarm and a service alarm, are configured through a monitoring alarm system, wherein the hardware alarm is used for monitoring resource consumption conditions of a CPU (central processing unit), a memory, network flow and the like of a server of a target machine and alarms when the resource consumption conditions are larger than a threshold value, an application program alarm is used for monitoring a jvm state of the target machine and judging whether an application health check interface is normal or not, the state is normal, namely the alarm, the service alarm is an alarm which is actively reported by the Tomcat application and takes a current timestamp as a service value, and the index is larger than 0, namely the alarm.
Generally speaking, when a target machine normally works, a hardware alarm index, a service alarm index and an application alarm index are all larger than a threshold value 0, so that the three alarm indexes are all triggered all the time, a self-monitoring system can regularly check whether the three alarms are normally issued to a receiver terminal, and the issuing modes comprise short message alarm reminding, mail alarm reminding, communication software alarm reminding and the like.
In the above embodiment, the method for the self-monitoring system to periodically obtain the alarm index from the monitoring alarm system according to the trigger frequency, and determine whether the alarm trigger time of the monitoring alarm system is overtime according to the timestamp and the current timestamp when the service alarm index in the alarm index is generated, includes:
the self-monitoring system checks whether the acquired alarm indexes simultaneously have a hardware alarm index, a service alarm index and an application program alarm index, and when the alarm indexes simultaneously exist, the self-monitoring system determines the time difference value of a timestamp generated based on the service alarm index and a current timestamp obtained when the self-monitoring system acquires the alarm index; and when the time difference value does not exceed the first threshold value, judging that the alarm triggering time of the monitoring alarm system is not overtime, otherwise, judging that the alarm triggering time of the monitoring alarm system is overtime.
In specific implementation, the self-monitoring system is deployed on a Serverless service, and the triggering mode is configured to be executed once every n minutes. The task flow of the self-monitoring system according to the monitoring and warning system is as follows:
1. checking whether the three alarm indexes exist on Prometheus at the same time, if not, indicating that the alarm is abnormal, if so, continuously checking the timestamp of the service alarm index and the current time to compare and calculate the difference, if the difference does not exceed a first threshold, indicating that the alarm trigger is not overtime, otherwise, indicating that the alarm trigger is overtime;
2. checking whether a database of a monitoring alarm system has trigger records of the three alarms, comparing a timestamp inserted into the database with a timestamp of a service alarm index to calculate a difference value, if the difference value does not exceed a second threshold value, indicating that the alarm trigger is not overtime, otherwise, indicating that the alarm trigger is overtime;
3. checking whether all channels of the notification system successfully send the three alarms or not;
4. checking the receipt record table of the notification system, checking the receipt record for the notification modes (short messages and telephones) of the receipt channels, and judging whether the notification is successfully issued to the receiver terminal.
5. And if any one step is abnormal, notifying a relevant responsible person through a notification mode of the cloud monitoring service, and further realizing the full link monitoring of the alarm process.
The detection process adopts a pipeline mode to disassemble the whole monitoring alarm link into a plurality of links and polls the state of each step, thereby detecting the whole link, not only confirming whether the alarm notification reaches a receiver, but also positioning the specific step of failure.
In the above embodiment, the method for the monitoring alarm system to periodically pull the monitoring index from the self-monitoring system according to the trigger frequency, and compare the monitoring index with the preset threshold to determine whether the alarm trigger state of the self-monitoring system is normal includes:
the monitoring alarm system periodically pulls monitoring indexes from the self-monitoring system according to the trigger frequency, wherein the monitoring indexes comprise one or more of the total calling times, wrong calling times, calling processing time and calling memory consumption of the self-monitoring system in a trigger frequency period; the monitoring alarm system compares and judges the pulled monitoring indexes with preset threshold values in a one-to-one correspondence mode, when any index exceeds the preset threshold value, the alarm trigger state of the self-monitoring system is considered to be abnormal, and otherwise, the alarm trigger state of the self-monitoring system is considered to be normal.
In specific implementation, the task flow of the monitoring and self-monitoring system of the monitoring and warning system is as follows:
1. prometheus draws metrics from a monitored system, including: the total number of times of calling in n minutes, the number of times of calling errors in n minutes, the calling processing time (function execution time) and the memory consumption;
2. prometheus configures an alarm of a self-monitoring system, and alarms when any one or more conditions of the total calling times within n minutes being 0, the calling error times within n minutes being more than 0, the function execution time being more than 30s and the memory consumption being more than 500M occur;
3. prometheus triggers an alarm, informs an alert manager to remove duplicate suppression silence operation on an alarm group, reduces the disturbance rate of a user for receiving the alarm notification, and then calls a monitoring alarm system; illustratively, grouping is to assemble the alarms of multiple database instances into one alarm, inhibiting means that if a certain machine hangs, all applications deployed in the machine do not alarm and only alarm the machine to hang, and silencing means that no alarm is given on a non-working day and only an alarm is given on a working day.
4. The monitoring alarm system searches for a corresponding receiver and selects different notification modes according to the alarm level;
5. and calling a notification system to transmit the alarm notification to the terminal of the receiver in a selected notification mode.
In the above embodiment, after the sending the alarm triggering result to the set receiver according to the triggering frequency periodically based on the alarm triggering time and the judgment result of the alarm triggering state, the method further includes:
the self-monitoring system periodically checks a receipt record table of the notification system and judges whether the alarm triggering result is successfully sent to a specified receiver; and if the judgment result is that the alarm trigger result is not successfully sent to the preset receiver, the alarm trigger result is sent to a responsible person configured in advance in the notification system again.
In specific implementation, the receiver is a user and is responsible for being a responsible person of the related system, and when the alarm triggering result cannot be normally sent to a preset appointed receiver, the abnormal phenomenon is timely sent to the responsible person so as to inform the responsible person to perform troubleshooting and maintenance on the related system.
EXAMPLE III
The present embodiment provides a computer-readable storage medium having stored thereon a computer program, which, when being executed by a processor, performs the steps of the above-mentioned full link monitoring method for monitoring an alarm system.
Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by this embodiment are the same as the beneficial effects of the full-link monitoring method for monitoring an alarm system provided by the above technical solution, and are not described herein again.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the invention may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the embodiment, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (9)
1. A full link monitoring method for monitoring an alarm system, comprising:
respectively deploying a monitoring alarm system and a self-monitoring system in different environments to enable the monitoring alarm system and the self-monitoring system to monitor each other;
configuring an alarm index of a monitoring alarm system and a trigger frequency of a self-monitoring system;
the self-monitoring system periodically acquires alarm indexes from the monitoring alarm system according to the trigger frequency, and judges whether the alarm trigger time of the monitoring alarm system is overtime or not according to the timestamp and the current timestamp when the business alarm indexes in the alarm indexes are generated;
the monitoring alarm system periodically pulls a monitoring index from the self-monitoring system according to the trigger frequency, compares the monitoring index with a preset threshold value and judges whether the alarm trigger state of the self-monitoring system is normal or not;
based on the alarm triggering time and the judgment result of the alarm triggering state, regularly sending alarm triggering results to a set receiver according to the triggering frequency, and respectively deploying a monitoring alarm system and a self-monitoring system in different environments comprises the following steps:
the monitoring alarm system is deployed on a target machine, and the self-monitoring system is deployed on a server-free framework of cloud service.
2. The method of claim 1, wherein configuring the alarm indicators of the monitoring alarm system and the trigger frequency of the self-monitoring system comprises:
the configured alarm indexes of the monitoring alarm system comprise hardware alarm indexes, service alarm indexes and application program alarm indexes;
the configured trigger frequency of the self-monitoring system refers to that a hardware alarm index, a service alarm index and an application alarm index are obtained from the monitoring alarm system regularly according to the trigger frequency.
3. The method of claim 2, wherein the method for periodically obtaining the alarm indicator from the monitoring alarm system according to the trigger frequency from the monitoring alarm system, and determining whether the alarm trigger time of the monitoring alarm system is over time according to the timestamp of the generation of the service alarm indicator in the alarm indicator and the current timestamp, comprises:
the self-monitoring system checks whether the acquired alarm indexes simultaneously have hardware alarm indexes, business alarm indexes and application program alarm indexes, and the self-monitoring system determines the time difference between the timestamp generated by the business alarm indexes and the current timestamp acquired by the self-monitoring system when the alarm indexes simultaneously exist;
and when the time difference value does not exceed the first threshold value, judging that the alarm triggering time of the monitoring alarm system is not overtime, otherwise, judging that the alarm triggering time of the monitoring alarm system is overtime.
4. The method of claim 3, wherein the method for the self-monitoring system to periodically obtain the alarm indicator from the monitoring alarm system according to the trigger frequency and determine whether the alarm trigger time of the monitoring alarm system is overtime according to the timestamp of the occurrence of the service alarm indicator in the alarm indicator and the current timestamp, further comprises:
the self-monitoring system checks whether an alarm triggering record of a hardware alarm index, a service index alarm index and an application program alarm index exists in a database of the monitoring alarm system, and simultaneously, based on the time difference value of a timestamp when the service alarm index in the database is generated and a timestamp when the alarm triggering record is written into the database,
and when the time difference does not exceed the second threshold, judging that the alarm triggering time of the monitoring alarm system is not overtime, otherwise, judging that the alarm triggering time of the monitoring alarm system is overtime.
5. The method of claim 4, wherein the monitoring alarm system periodically pulls the monitoring index from the self-monitoring system according to the trigger frequency, and the method for comparing the monitoring index with the preset threshold to determine whether the alarm trigger state of the self-monitoring system is normal comprises:
the monitoring alarm system periodically pulls monitoring indexes from the self-monitoring system according to the trigger frequency, wherein the monitoring indexes comprise one or more of the total calling times, wrong calling times, calling processing time and calling memory consumption of the self-monitoring system in a trigger frequency period;
the monitoring alarm system compares and judges the pulled monitoring indexes with preset thresholds in a one-to-one correspondence mode, when any index exceeds the preset threshold, the alarm triggering state of the self-monitoring system is considered to be abnormal, and otherwise, the alarm triggering state of the self-monitoring system is considered to be normal.
6. The method according to any one of claims 1-5, wherein after periodically sending the alarm trigger result to the set receiver according to the trigger frequency based on the alarm trigger time and the judgment result of the alarm trigger state, the method further comprises:
the self-monitoring system periodically checks a receipt record table of the notification system and judges whether an alarm triggering result is successfully sent to a specified receiver;
and if the judgment result is that the alarm trigger result is not successfully sent to the preset receiver, the alarm trigger result is sent to a responsible person configured in advance in the notification system again.
7. The method according to any one of claims 1-5, wherein the monitoring and warning system comprises a prometheus system, and the Serverless architecture of the cloud service is a Serverless service.
8. A full link monitoring system for monitoring an alarm system, comprising:
the deployment unit is used for respectively deploying the monitoring alarm system and the self-monitoring system in different environments so as to enable the monitoring alarm system and the self-monitoring system to monitor each other;
the configuration unit is used for configuring the alarm index of the monitoring alarm system and the trigger frequency of the self-monitoring system;
the first monitoring unit is used for periodically acquiring an alarm index from the monitoring alarm system through the self-monitoring system according to the trigger frequency, and judging whether the alarm trigger time of the monitoring alarm system is overtime or not according to a timestamp and a current timestamp when a service alarm index in the alarm index is generated;
the second monitoring unit is used for pulling a monitoring index from the self-monitoring system periodically according to the trigger frequency through the monitoring alarm system, and comparing the monitoring index with a preset threshold value to judge whether the alarm trigger state of the self-monitoring system is normal or not;
a sending detection unit which sends alarm trigger results to a set receiver according to the trigger frequency periodically based on the alarm trigger time and the judgment result of the alarm trigger state,
the method for respectively deploying the monitoring alarm system and the self-monitoring system in different environments comprises the following steps:
the monitoring alarm system is deployed on a target machine, and the self-monitoring system is deployed on a server-free framework of the cloud service.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110612028.0A CN113381884B (en) | 2021-06-02 | 2021-06-02 | Full link monitoring method and device for monitoring alarm system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110612028.0A CN113381884B (en) | 2021-06-02 | 2021-06-02 | Full link monitoring method and device for monitoring alarm system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113381884A CN113381884A (en) | 2021-09-10 |
CN113381884B true CN113381884B (en) | 2023-01-31 |
Family
ID=77575325
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110612028.0A Active CN113381884B (en) | 2021-06-02 | 2021-06-02 | Full link monitoring method and device for monitoring alarm system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113381884B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115118581B (en) * | 2022-06-27 | 2024-04-12 | 广东长天思源环保科技股份有限公司 | Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G |
CN115883326A (en) * | 2022-12-07 | 2023-03-31 | 中盈优创资讯科技有限公司 | Method and device for carrying out alarm pressure measurement based on JMeter simulation moving ring equipment |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN203763556U (en) * | 2014-02-24 | 2014-08-13 | 广东宝莱特医用科技股份有限公司 | System with double monitoring functions |
CN105974906B (en) * | 2016-05-12 | 2019-12-17 | 深圳市中工巨能科技有限公司 | Double-monitoring-activating measurement and control device |
CN107733672A (en) * | 2016-08-12 | 2018-02-23 | 南京中兴软件有限责任公司 | Fault handling method, device and controller |
CN106776243B (en) * | 2016-12-30 | 2021-03-16 | 中国银联股份有限公司 | Monitoring method and device for monitoring software |
CN111083003A (en) * | 2018-10-22 | 2020-04-28 | 中兴通讯股份有限公司 | Monitoring system and method, storage medium and processor |
CN110855473B (en) * | 2019-10-16 | 2022-11-18 | 平安科技(深圳)有限公司 | Monitoring method, device, server and storage medium |
CN111581060B (en) * | 2020-05-11 | 2024-03-12 | 金蝶软件(中国)有限公司 | Prometaus-based log alarm system, method and related equipment |
CN111949483A (en) * | 2020-08-13 | 2020-11-17 | 星辰天合(北京)数据科技有限公司 | Monitoring device and monitoring system |
CN112732536B (en) * | 2020-12-30 | 2023-01-13 | 平安科技(深圳)有限公司 | Data monitoring and alarming method and device, computer equipment and storage medium |
-
2021
- 2021-06-02 CN CN202110612028.0A patent/CN113381884B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113381884A (en) | 2021-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113381884B (en) | Full link monitoring method and device for monitoring alarm system | |
CN102932466B (en) | The distributed source method for supervising of content-based distributing network and system | |
CN109218102A (en) | A kind of alarm monitoring method and system | |
CN103810076B (en) | The monitoring method and device of data duplication | |
CN106487612A (en) | A kind of server node monitoring method, monitoring server and system | |
CN111510351B (en) | Anomaly detection method and device based on Promissuris monitoring system | |
CN111404740A (en) | Fault analysis method and device, electronic equipment and computer readable storage medium | |
CN114513400A (en) | Log aggregation system and method for improving availability of log aggregation system | |
CN113518020A (en) | Method, device and equipment for detecting disconnection return and readable storage medium | |
CN117789433A (en) | Alarm method and device in DVS external damage prevention platform and electronic equipment | |
CN109699041B (en) | RRU channel fault diagnosis processing method, device and computer storage medium | |
CN115002001B (en) | Method, device, equipment and medium for detecting sub-health of cluster network | |
CN108010559A (en) | A kind of storage device warning system and method | |
CN105991305A (en) | Method and device of identifying link abnormity | |
CN112181780A (en) | Detection and alarm method, device and equipment for containerized platform core component | |
CN114168371A (en) | Intelligent automatic fault alarm system | |
TW201409968A (en) | Information and communication service quality estimation and real-time alarming system and method | |
CN111918233B (en) | Anomaly detection method suitable for wireless aviation network | |
CN103368754B (en) | A kind of methods, devices and systems and equipment for detecting traffic failure | |
CN115118575B (en) | Monitoring method, monitoring device, electronic equipment and storage medium | |
CN114118991B (en) | Third party system monitoring system, method, device, equipment and storage medium | |
CN115913895B (en) | Method, device, equipment and medium for diagnosing and alarming server faults | |
CN115174356B (en) | Cluster alarm reporting method, device, equipment and medium | |
CN113300908B (en) | Link monitoring method and system based on unidirectional network boundary equipment | |
JP2011030094A (en) | Radio communication system for traveling body and failure processing method of the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |