CN115174350A - Operation and maintenance warning method, device, equipment and medium - Google Patents

Operation and maintenance warning method, device, equipment and medium Download PDF

Info

Publication number
CN115174350A
CN115174350A CN202210764564.7A CN202210764564A CN115174350A CN 115174350 A CN115174350 A CN 115174350A CN 202210764564 A CN202210764564 A CN 202210764564A CN 115174350 A CN115174350 A CN 115174350A
Authority
CN
China
Prior art keywords
target
alarm
resource
monitoring data
resources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210764564.7A
Other languages
Chinese (zh)
Inventor
路小敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Inspur Data Technology Co Ltd
Original Assignee
Jinan Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Inspur Data Technology Co Ltd filed Critical Jinan Inspur Data Technology Co Ltd
Priority to CN202210764564.7A priority Critical patent/CN115174350A/en
Publication of CN115174350A publication Critical patent/CN115174350A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery

Abstract

The application discloses an operation and maintenance warning method, device, equipment and medium, and relates to the technical field of information. The method comprises the following steps: collecting monitoring data corresponding to resources in a target service through a preset monitoring collector, and comparing the monitoring data with a preset alarm threshold if the monitoring data is collected; acquiring a target resource of which the monitoring data exceeds the preset alarm threshold, and judging whether the target resource is the same type of resource or not according to target monitoring data corresponding to the target resource; and if the target resources are the same type of resources, determining a target alarm root cause of an alarm event aiming at the target resources according to the incidence relation among the resources in the target service by using the target monitoring data so as to carry out corresponding alarm and fault repair according to the target alarm root cause. By the technical scheme, the alarm generated by the large-scale cluster can be quickly responded, and the root cause of the alarm or the fault is analyzed so as to increase the operation and maintenance convenience.

Description

Operation and maintenance warning method, device, equipment and medium
Technical Field
The present invention relates to the field of information technologies, and in particular, to an operation and maintenance warning method, apparatus, device, and medium.
Background
With the rapid development of the cloud computing field, the cloud platform is mature, the system scale is huge, and the performance requirement of the system is high. The system is huge, the number of modules is large, the dependence is large, and the delimited analysis of the problem is more and more difficult and complex. How to quickly deal with the alarm generated by the large-scale cluster, especially how to generate the flood alarm by causing other problems aiming at a certain problem, is an urgent problem to be solved. The existing warning treatment for monitoring data is to adopt a threshold setting mode for monitoring abnormity of resources, and generate a warning when a certain threshold is exceeded. Currently, in order to reduce flood alarms, multiple pieces of collected monitoring alarm information with the same attribute are merged.
On the one hand, however, in the prior art, a fixed alarm threshold needs to be manually set, and only an alarm can be given according to the manually set threshold, so that the overall accuracy and recall rate are low; on the other hand, the alarm quantity is large, so that the alarm effect is discounted, particularly, flood alarm can be generated due to other problems caused by a certain key component or resource problem, and the analysis and investigation difficulty is increased; moreover, when a problem occurs, only an alarm is given and the root cause of the problem is not found, the reason for generating the alarm still needs to be checked subsequently. In summary, how to quickly deal with the alarms generated by the large-scale cluster and analyze the root cause of the alarms or faults so as to increase the convenience of operation and maintenance needs to be further solved.
Disclosure of Invention
In view of the above, an object of the present invention is to provide an operation and maintenance warning method, apparatus, device and medium, which can quickly handle the warning generated by a large-scale cluster and analyze the root cause of the warning or fault to increase the convenience of operation and maintenance. The specific scheme is as follows:
in a first aspect, the application discloses an operation and maintenance warning method, which includes:
collecting monitoring data corresponding to resources in a target service through a preset monitoring collector, and if the monitoring data are collected, comparing the monitoring data with a preset alarm threshold value;
acquiring a target resource of which the monitoring data exceeds the preset alarm threshold, and judging whether the target resource is the same type of resource or not according to target monitoring data corresponding to the target resource;
and if the target resources are the same type of resources, determining a target alarm root cause of an alarm event aiming at the target resources according to the incidence relation among the resources in the target service by using the target monitoring data so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
Optionally, after the monitoring data corresponding to the resource in the target service is collected by the preset monitoring collector, the method further includes:
if the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, searching and storing error information of a data calling request corresponding to the collected data sent by the preset monitoring collector through request calling link analysis;
correspondingly, the determining, by using the target monitoring data, a target alarm root cause of an alarm event for the target resource according to the association relationship between the resources in the target service, so as to perform corresponding alarm and fault repair according to the target alarm root cause, includes:
and determining a target alarm root cause of an alarm event aiming at the target resource according to the incidence relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
Optionally, the searching and storing error information of the data call request corresponding to the collected data sent by the preset monitoring collector through requesting call link analysis includes:
determining a corresponding path of the data call request in a system according to a unique request identity number corresponding to the data call request corresponding to the collected data sent by the preset monitoring collector;
and searching for a call failure event of the data call request in the call process in the path, and storing error information corresponding to the call failure event.
Optionally, after the obtaining of the target resource of which the monitoring data exceeds the preset alarm threshold and determining whether the target resource is a resource of the same type according to the target monitoring data corresponding to the target resource, the method further includes:
if the target resources are not the same type of resources, judging a target alarm importance level corresponding to the alarm event of the target resources according to the relation between the target monitoring data and a preset alarm importance;
and alarming aiming at the alarm event of the target resource according to the target alarm importance level.
Optionally, if the target resource is a resource of the same type, after determining a target alarm root of an alarm event for the target resource according to an association relationship between resources in the target service by using the target monitoring data, the method further includes:
judging the target alarm importance level corresponding to the alarm event aiming at the target resource according to the target alarm root factor based on the preset alarm importance;
if the target alarm importance level is higher than the preset level threshold, alarming the alarm event of the target resource;
and if the target alarm importance level is not higher than the preset level threshold, directly alarming the alarm event of the target resource by waiting for the alarm time point after the next preset alarm period.
Optionally, the determining, by using the target monitoring data and according to the association relationship between the resources in the target service, a target alarm root cause of an alarm event for the target resource includes:
determining the alarm event corresponding to the target monitoring data by using the target monitoring data, and deducing the current state of a target resource according to the alarm event;
and analyzing the reason of the alarm event according to the current state of the target resource, and determining a target alarm root factor of the alarm event aiming at the target resource according to the incidence relation among the resources in the target service.
Optionally, if the target resource is a resource of the same type, after determining a target alarm root of an alarm event for the target resource according to an association relationship between resources in the target service by using the target monitoring data, the method further includes:
determining a corresponding preset fault repairing script according to the target alarm root cause, and repairing the fault corresponding to the target alarm root cause through the preset fault repairing script;
if the target alarm root is successfully repaired due to the corresponding fault, alarm recovery is carried out and a corresponding fault repair success event is pushed;
and if the fault corresponding to the target alarm root factor is not successfully repaired, pushing the target alarm root factor to relevant operation and maintenance personnel to adjust the preset fault repairing script or manually perform corresponding fault repairing.
In a second aspect, the present application discloses an operation and maintenance warning device, including:
the data acquisition module is used for acquiring monitoring data corresponding to resources in a target service through a preset monitoring acquisition unit, and if the monitoring data is acquired, comparing the monitoring data with a preset alarm threshold value;
the type judgment module is used for acquiring the target resource of which the monitoring data exceeds the preset alarm threshold value and judging whether the target resource is the same type of resource or not according to the target monitoring data corresponding to the target resource;
and the root cause determining module is used for determining a target alarm root cause of an alarm event aiming at the target resource according to the incidence relation among the resources in the target service by utilizing the target monitoring data if the target resource is the same type of resource so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
and the processor is used for executing the computer program to realize the steps of the operation and maintenance warning method disclosed in the foregoing disclosure.
In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program is executed by a processor to implement the steps of the operation and maintenance warning method disclosed in the foregoing.
When the operation and maintenance alarm is carried out, firstly, monitoring data corresponding to resources in a target service are collected through a preset monitoring collector, if the monitoring data are collected, the monitoring data are compared with a preset alarm threshold value, then the monitoring data are acquired, the monitoring data exceed the target resources of the preset alarm threshold value, and whether the target resources are the same type of resources or not is judged through the target monitoring data corresponding to the target resources, if the target resources are the same type of resources, the target monitoring data are utilized according to the incidence relation between the resources in the target service, the target alarm root cause of the alarm event of the target resources is determined, so that the target alarm root cause is correspondingly alarmed and fault repairing is carried out. It can be seen that, when the operation and maintenance alarm is performed, the method and the system collect monitoring data corresponding to resources in a target service through a preset monitoring collector, compare the monitoring data with an alarm threshold, then acquire a target resource of which the monitoring data exceeds the preset alarm threshold, and judge whether the target resource is a same type of resource, analyze a target alarm root cause of the alarm event if the target resource is the same type of resource, and perform corresponding alarm and repair through the target alarm root cause. Therefore, when the operation and maintenance alarm is carried out, the target resource which causes the alarm is preliminarily judged through the preset alarm threshold, whether the alarm event is a flood alarm is further judged through judging whether the target resource is the same type of resource, and if the alarm event is the flood alarm, the target alarm root cause of the alarm event aiming at the target resource is determined by utilizing the target monitoring data according to the incidence relation between the resources in the target service; on the other hand, whether the alarm event is a flood alarm is further judged by judging whether the target resource is the same type of resource, so that the problems that the efficacy of the alarm is discounted and the analysis and the investigation are difficult due to the large number of flood alarms caused by the flood alarms generated by other problems caused by a certain key component or resource problem are solved; furthermore, root cause analysis is carried out according to the incidence relation among the resources in the target service by utilizing the target monitoring data, so that the root cause of the alarm problem can be analyzed in time when the alarm problem occurs, the alarm recovery and fault recovery can be further carried out through the target alarm root cause in the follow-up process, the alarm recovery speed is increased, and meanwhile, the operation and maintenance convenience is improved. In conclusion, the method and the device can rapidly deal with the alarm generated by the large-scale cluster and analyze the root cause of the alarm or fault so as to increase the operation and maintenance convenience.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of an operation and maintenance warning method provided in the present application;
FIG. 2 is a schematic diagram of a target alarm root cause analysis provided herein;
FIG. 3 is a flowchart of a specific operation and maintenance warning method provided in the present application;
FIG. 4 is a flowchart of a specific operation and maintenance warning method provided in the present application;
FIG. 5 is a schematic diagram of an alarm process provided herein;
fig. 6 is a schematic structural diagram of an operation and maintenance warning device provided in the present application;
fig. 7 is a block diagram of an electronic device provided in the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The existing warning treatment for monitoring data is that a threshold value setting mode is adopted for monitoring abnormity of resources, and a warning is generated when a certain threshold value is exceeded. Currently, in order to reduce flood alarms, multiple pieces of collected monitoring alarm information with the same attribute are merged. On the one hand, however, in the prior art, a fixed alarm threshold needs to be manually set, and only an alarm can be given according to the manually set threshold, so that the overall accuracy and recall rate are low; on the other hand, the alarm quantity is large, so that the alarm effect is discounted, particularly, flood alarm can be generated due to other problems caused by a certain key component or resource problem, and the analysis and investigation difficulty is increased; moreover, when a problem occurs, only an alarm is given and the root cause of the problem is not found, the reason for generating the alarm still needs to be checked subsequently. Therefore, the operation and maintenance alarming method can rapidly deal with the alarms generated by the large-scale cluster and analyze the root cause of the alarms or faults so as to increase the operation and maintenance convenience.
The embodiment of the invention discloses an operation and maintenance alarm method, which is shown in figure 1 and comprises the following steps:
step S11: monitoring data corresponding to resources in a target service are collected through a preset monitoring collector, and if the monitoring data are collected, the monitoring data are compared with a preset alarm threshold value.
In this embodiment, the preset alarm threshold is an alarm threshold obtained through a preset interface in advance. It can be understood that the preset alarm threshold value user can perform corresponding adjustment according to actual conditions, can manually set the alarm threshold value according to experience, and can also display the alarm threshold value result of automatic adjustment. When the monitoring data does not exceed the preset alarm threshold, the corresponding alarm event is not triggered; in this embodiment, after the monitoring data exceeds the preset alarm threshold, the alarm event is not triggered immediately and subsequent further determination is required.
In this embodiment, a preset monitoring collector deployed in a target service in advance is used to collect monitoring data corresponding to resources in the target service, and after the monitoring data is obtained, the monitoring data is compared with a preset alarm threshold. In a specific embodiment, taking monitoring of a virtual machine in which openstack is deployed with service usage and a storage volume is hung as an example, actual data of each index of the virtual machine is acquired through telegraf, and if the monitoring data is collected, the monitoring data is divided into link types. According to the technical scheme, the monitoring data of the resources in the target service is acquired, and the monitoring resource data is compared with the preset alarm threshold value, so that the target resources causing the alarm are preliminarily judged through the preset alarm threshold value, and further judgment is subsequently carried out on the target alarm resources causing the alarm event.
Step S12: and acquiring the target resource of which the monitoring data exceeds the preset alarm threshold, and judging whether the target resource is the same type of resource or not according to the target monitoring data corresponding to the target resource.
In this embodiment, the monitoring data is compared with a preset alarm threshold value through the monitoring data, a target resource of which the monitoring data exceeds the preset alarm threshold value is obtained, and whether the target resource is a same type of resource is judged through the target monitoring data corresponding to the target resource. Specifically, whether the target resources are resources of the same type is judged through target monitoring data corresponding to the target resources, if the target resources causing the alarm event are not resources of the same type, the alarm event caused by the current target resources is represented as a common alarm event, and further alarm can be continued; if the target resources causing the alarm events are of the same type, the fact that a large number of alarm events caused by the target resources are flood alarms is indicated, and roots of the alarm events caused by the target resources need to be searched and further judged. By the technical scheme, whether the alarm event is a flood alarm or not is further judged by judging whether the target resource is the same type of resource, so that the problems that the efficacy of the alarm is discounted and the analysis and the investigation are difficult due to the fact that other problems can generate the flood alarm and the number of the flood alarms is large caused by the problem of a certain key component or resource are solved.
Step S13: and if the target resources are the same type of resources, determining a target alarm root cause of an alarm event aiming at the target resources according to the incidence relation among the resources in the target service by using the target monitoring data so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
In this embodiment, the determining, by using the target monitoring data and according to the association relationship between the resources in the target service, a target alarm root cause of an alarm event for the target resource includes: determining the alarm event corresponding to the target monitoring data by using the target monitoring data, and deducing the current state of the target resource according to the alarm event; and analyzing the reason of the alarm event according to the current state of the target resource, and determining a target alarm root factor of the alarm event aiming at the target resource according to the incidence relation among the resources in the target service. Further, an important level alarm or an alarm item and an alarm item with a wide influence range are stored in a preset scene evaluation database in advance, and the incidence relation among the resources in the incidence scene evaluation target service is input. The monitoring index data of monitoring items, such as the monitoring items, which are automatically identified and recorded in the preset scene evaluation database, reach a certain value and cause downtime; scenes with a wide influence range, such as the situation that a similar storage back end cannot be accessed, so that a storage volume cannot be accessed; the corresponding modification is manually identified and fed back. Wherein, the root cause analysis finds out the root cause of the alarm or the fault according to the rule or the algorithm, and when the alarm disappears, other alarms generated by the alarm disappear. Specifically, a target alarm root cause analysis schematic diagram is shown in fig. 2, if the target resources are resources of the same type, data of modules such as a computing storage network and the like are received to generate resource entities, data of a monitoring module obtains the alarm entities, the alarm entities are mounted on the resource entities responding according to the corresponding relation to form a directed entity diagram, and an alarm and a state derivation are performed through a scene evaluator and according to a preset scene evaluation database according to formulated rules, so that causal root cause analysis is realized and root causes are associated to determine the target alarm root causes of alarm events for the target resources.
In this embodiment, after determining the target alarm root of the alarm event for the target resource according to the association relationship between the resources in the target service by using the target monitoring data, the method further includes: judging the target alarm importance level corresponding to the alarm event aiming at the target resource according to the target alarm root factor based on the preset alarm importance; if the target alarm importance level is higher than the preset level threshold, alarming the alarm event of the target resource; and if the target alarm importance level is not higher than the preset level threshold, directly alarming the alarm event of the target resource by waiting for the alarm time point after the next preset alarm period. Specifically, the monitoring item data is compared with an alarm threshold value and data information in the label management is combined, a corresponding alarm task is triggered to generate an alarm, and the alarm information is sent to corresponding services, such as elastic expansion, thermal expansion and the like; and actively pushing the alarm task periods with wide influence range and high importance level after the root cause analysis, wherein the alarm task periods are different.
In this embodiment, after determining the target alarm root of the alarm event for the target resource according to the association relationship between the resources in the target service by using the target monitoring data, the method further includes: determining a corresponding preset fault repairing script according to the target alarm root cause, and repairing the fault corresponding to the target alarm root cause through the preset fault repairing script; if the target alarm root is successfully repaired due to the corresponding fault, alarm recovery is carried out and a corresponding fault repair success event is pushed; and if the fault corresponding to the target alarm root cause is not successfully repaired, pushing the target alarm root cause to relevant operation and maintenance personnel to adjust the preset fault repairing script or manually perform corresponding fault repairing. Specifically, various fault repairing scripts are injected in advance, and when corresponding faults occur, relevant scripts are called to repair the faults. And after the repair is successful, pushing a repair success event or alarm recovery, and when the repair is unsuccessful, pushing a result to an operation and maintenance person, generating an alarm, an event or failure repair and sending a message to a related person in charge for script adjustment or manual repair by configuring a short message, a mailbox and the like. By the technical scheme, the target monitoring data is utilized to carry out root cause analysis according to the incidence relation among the resources in the target service, so that the root cause of the alarm problem can be analyzed in time when the alarm problem occurs, further alarm recovery and fault repair can be carried out through the target alarm root cause in the following process, the alarm recovery speed is accelerated, and the operation and maintenance convenience is improved; on the other hand, the method and the device label the alarm items with high alarm level or important alarm items and wide influence range, and can push the alarm items in real time after the alarm with large influence range is generated, thereby reducing the alarm delay of the important alarm.
It can be seen that, in the embodiment, when performing operation and maintenance alarm, first, monitoring data corresponding to resources in a target service is acquired through a preset monitoring collector, the monitoring data is compared with an alarm threshold, then, a target resource of which the monitoring data exceeds the preset alarm threshold is acquired, whether the target resource is a same type of resource is judged, if the target resource is a same type of resource, a target alarm root cause of the alarm event is analyzed, and corresponding alarm and repair are performed through the target alarm root cause. Therefore, when operation and maintenance alarming is carried out, the target resource which causes alarming is preliminarily judged through a preset alarming threshold value, whether the alarming event is flood alarming is further judged through judging whether the target resource is the same type of resource, and if the alarming event is flood alarming, the target alarming root cause of the alarming event aiming at the target resource is determined by utilizing the target monitoring data according to the incidence relation among the resources in the target service; on the other hand, whether the alarm event is a flood alarm is further judged by judging whether the target resource is the same type of resource, so that the problems that the efficacy of the alarm is discounted and the analysis and the investigation are difficult due to the large number of flood alarms caused by the flood alarms generated by other problems caused by a certain key component or resource problem are solved; furthermore, root cause analysis is carried out according to the incidence relation among the resources in the target service by utilizing the target monitoring data, so that the root cause of the alarm problem can be analyzed in time when the alarm problem occurs, the alarm recovery and fault recovery can be further carried out through the target alarm root cause in the follow-up process, the alarm recovery speed is increased, and meanwhile, the operation and maintenance convenience is improved. In conclusion, the method and the device can rapidly deal with the alarm generated by the large-scale cluster and analyze the root cause of the alarm or fault so as to increase the operation and maintenance convenience.
Referring to fig. 3, the embodiment of the present invention discloses a specific operation and maintenance warning method, and compared with the previous embodiment, the present embodiment further explains and optimizes the technical solution.
Step S21: and collecting monitoring data corresponding to the resources in the target service through a preset monitoring collector.
Step S22: and if the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, searching and storing error information of a data calling request corresponding to the collected data sent by the preset monitoring collector through request calling link analysis.
In this embodiment, searching and storing error information of a data call request corresponding to collected data sent by the preset monitoring collector through a request call link analysis includes: determining a path corresponding to the data calling request in the system according to a unique request identity identification number corresponding to the data calling request corresponding to the collected data sent by the preset monitoring collector; and searching for a call failure event of the data call request in the call process in the path, and storing error information corresponding to the call failure event. Specifically, each link calling process of the request is recorded, each call request generates a globally unique ID (Identity Document, ID identification number) for identifying the request, the ID does not change in the calling process, and the path of the user request in the system is concatenated through the secondary ID as the call of each layer is continuously transmitted. If the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, the type of the monitoring data is divided into a link obstructed type, and if an error exists in a calling failure in a certain calling process, the error information is stored, and root cause analysis is carried out on the error information.
In a specific embodiment, if the stored network is not communicated, the storage volume network on the storage volume network is inferred to be not communicated, and the storage network is not communicated as an optimal solution; the path of the storage volume is not communicated, and the path of the storage volume is not communicated to be suboptimal; the storage and storage of the volume path is not common, the storage path is not common as root cause, and the root cause is related according to the sequence, the hierarchical relationship and the like. By the technical scheme, the trouble of troubleshooting existing problems is solved by adopting a method for calling and analyzing the fault data link, so that the root cause analysis and troubleshooting can be further carried out on the existing problems when the monitoring data corresponding to the resources in the target service cannot be acquired.
Step S23: and determining a target alarm root cause of an alarm event aiming at the target resource according to the incidence relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
It can be seen that, in this embodiment, the error information of the data call request corresponding to the collected data sent by the preset monitoring collector is searched and stored by requesting call link analysis, and the error information determines the target alarm root cause of the alarm event for the target resource according to the incidence relation between the resources in the target service.
Referring to fig. 4, the embodiment of the present invention discloses a specific operation and maintenance warning method, and compared with the previous embodiment, the present embodiment further explains and optimizes the technical solution.
Step S31: monitoring data corresponding to resources in a target service are collected through a preset monitoring collector, and if the monitoring data are collected, the monitoring data are compared with a preset alarm threshold value.
Step S32: and acquiring the target resource of which the monitoring data exceeds the preset alarm threshold, and judging whether the target resource is the same type of resource or not according to the target monitoring data corresponding to the target resource.
Step S33: and if the target resources are not the same type of resources, judging the target alarm importance level corresponding to the alarm event of the target resources according to the relation between the target monitoring data and the preset alarm importance.
Specifically, whether the target resource is the same type of resource is judged through the target monitoring data corresponding to the target resource, if the target resource causing the alarm event is not the same type of resource, it indicates that the alarm event caused by the current target resource is a common alarm event, and further alarm can be continued.
Step S34: and alarming aiming at the alarm event of the target resource according to the target alarm importance level.
Specifically, if the target alarm importance level is higher than the preset level threshold, an alarm is performed on the alarm event of the target resource; and if the target alarm importance level is not higher than the preset level threshold, directly alarming the alarm event of the target resource by waiting for an alarm time point after the next preset alarm period. Further, an alarm flow diagram is shown in fig. 5, monitoring data corresponding to resources in a target service is collected through a preset monitoring collector, if the monitoring data is collected, the monitoring data is compared with a preset alarm threshold, a target resource of which the monitoring data exceeds the preset alarm threshold is obtained, whether the target resource is the same type of resource is judged through target monitoring data corresponding to the target resource, and if the target resource is the same type of resource, a target alarm root cause of an alarm event for the target resource is determined according to an association relationship between the resources in the target service by using the target monitoring data; if the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, error information of a data calling request corresponding to the collected data sent by the preset monitoring collector is searched and stored through a request calling link analysis, and a target alarm root cause of an alarm event aiming at the target resources is determined according to the incidence relation between the resources in the target service by utilizing the error information; if the target resources are not the same type of resources, judging a target alarm importance level corresponding to the alarm event of the target resources according to the relation between the target monitoring data and a preset alarm importance; and alarming aiming at the alarm event of the target resource according to the target alarm importance level.
Referring to fig. 6, an embodiment of the present application discloses an operation and maintenance warning device, including:
the data acquisition module 11 is configured to collect, by a preset monitoring collector, monitoring data corresponding to a resource in a target service, and if the monitoring data is collected, compare the monitoring data with a preset alarm threshold;
the type judgment module 12 is configured to acquire a target resource of which the monitoring data exceeds the preset alarm threshold, and judge whether the target resource is a same type of resource according to target monitoring data corresponding to the target resource;
a root cause determining module 13, configured to determine, by using the target monitoring data according to the association relationship between the resources in the target service, a target alarm root cause of an alarm event for the target resource if the target resource is a resource of the same type, so as to perform corresponding alarm and fault repair according to the target alarm root cause.
It can be seen that, in the embodiment, when performing operation and maintenance alarm, first, monitoring data corresponding to resources in a target service is acquired through a preset monitoring collector, the monitoring data is compared with an alarm threshold, then, a target resource of which the monitoring data exceeds the preset alarm threshold is acquired, whether the target resource is a same type of resource is judged, if the target resource is a same type of resource, a target alarm root cause of the alarm event is analyzed, and corresponding alarm and repair are performed through the target alarm root cause. Therefore, when operation and maintenance alarming is carried out, the target resource which causes alarming is preliminarily judged through a preset alarming threshold value, whether the alarming event is flood alarming is further judged through judging whether the target resource is the same type of resource, and if the alarming event is flood alarming, the target alarming root cause of the alarming event aiming at the target resource is determined by utilizing the target monitoring data according to the incidence relation among the resources in the target service; on the other hand, whether the alarm event is a flood alarm is further judged by judging whether the target resource is the same type of resource, so that the problems that the efficacy of the alarm is discounted and the analysis and the investigation are difficult due to the large number of flood alarms caused by the flood alarms generated by other problems caused by a certain key component or resource problem are solved; furthermore, root cause analysis is carried out according to the incidence relation among the resources in the target service by utilizing the target monitoring data, so that the root cause of the alarm problem can be analyzed in time when the alarm problem occurs, the alarm recovery and fault recovery can be further carried out through the target alarm root cause in the follow-up process, the alarm recovery speed is increased, and meanwhile, the operation and maintenance convenience is improved. In conclusion, the method and the device can rapidly deal with the alarm generated by the large-scale cluster and analyze the root cause of the alarm or fault so as to increase the operation and maintenance convenience.
In some specific embodiments, the operation and maintenance warning device further includes:
the link analysis module is used for searching and storing error information of a data calling request corresponding to the collected data sent by the preset monitoring collector through the request calling link analysis if the monitoring data corresponding to the resources in the target service cannot be collected by the preset monitoring collector;
correspondingly, the root cause determining module 13 is specifically configured to: and determining a target alarm root factor of an alarm event aiming at the target resource according to the incidence relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault repair according to the target alarm root factor.
In some specific embodiments, the link analysis module specifically includes:
the path determining unit is used for determining a path corresponding to the data calling request in the system according to the unique request identity identification number corresponding to the data calling request corresponding to the collected data sent by the preset monitoring collector;
and the failure time searching unit is used for searching a call failure event of the data call request in the call process of the path and storing error information corresponding to the call failure event.
In some specific embodiments, the operation and maintenance warning device further includes:
the first grade determining module is used for judging a target alarm importance level corresponding to the alarm event of the target resource according to the relation between the target monitoring data and a preset alarm importance level if the target resource is not the same type of resource;
and the first alarm module is used for alarming aiming at the alarm event of the target resource according to the target alarm importance level.
In some specific embodiments, the operation and maintenance warning device further includes:
a second level determination module, configured to determine, based on the preset alarm importance, a target alarm importance level corresponding to the alarm event for the target resource according to the target alarm root;
the timely warning module is used for warning the warning event of the target resource if the target warning importance level is higher than the preset level threshold value;
and the delay alarm module is used for waiting for an alarm time point after the next preset alarm period to directly alarm the alarm event of the target resource if the target alarm importance level is not higher than the preset level threshold.
In some specific embodiments, the root cause determining module 13 specifically includes:
the state determining unit is used for determining the alarm event corresponding to the target monitoring data by using the target monitoring data and deducing the current state of the target resource according to the alarm event;
and the alarm reason determining unit is used for analyzing the reason generated by the alarm event according to the current state of the target resource and determining a target alarm root reason of the alarm event aiming at the target resource according to the incidence relation among the resources in the target service.
In some specific embodiments, the operation and maintenance warning device further includes:
the fault repairing unit is used for determining a corresponding preset fault repairing script according to the target alarm root cause and repairing the fault corresponding to the target alarm root cause through the preset fault repairing script;
the alarm recovery module is used for carrying out alarm recovery and pushing a corresponding fault recovery success event if the target alarm root is successfully repaired due to the corresponding fault;
and the root cause pushing module is used for pushing the target alarm root cause to relevant operation and maintenance personnel to adjust the preset fault repair script or manually perform corresponding fault repair if the fault corresponding to the target alarm root cause is not successfully repaired.
Fig. 7 illustrates an electronic device 20 according to an embodiment of the present application. The electronic device 20 may further include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. The memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the operation and maintenance alarm method disclosed in any one of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is used to provide voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
In addition, the memory 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, etc., and the storage manner may be a transient storage manner or a permanent storage manner.
The operating system 221 is used for managing and controlling each hardware device on the electronic device 20, and the computer program 222 may be Windows Server, netware, unix, linux, or the like. The computer programs 222 may further include computer programs that can be used to perform other specific tasks in addition to the computer programs that can be used to perform the operation and maintenance alarm method performed by the electronic device 20 disclosed in any of the foregoing embodiments.
Further, the present application also discloses a computer readable storage medium for storing a computer program; wherein the computer program realizes the operation and maintenance warning method disclosed in the foregoing when being executed by a processor. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The operation and maintenance warning method, device, equipment and medium provided by the invention are described in detail, specific examples are applied in the description to explain the principle and the implementation mode of the invention, and the description of the above embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. An operation and maintenance alarming method is characterized by comprising the following steps:
collecting monitoring data corresponding to resources in a target service through a preset monitoring collector, and if the monitoring data are collected, comparing the monitoring data with a preset alarm threshold value;
acquiring a target resource of which the monitoring data exceeds the preset alarm threshold, and judging whether the target resource is the same type of resource or not according to target monitoring data corresponding to the target resource;
and if the target resources are the same type of resources, determining a target alarm root cause of an alarm event aiming at the target resources according to the incidence relation among the resources in the target service by using the target monitoring data so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
2. The operation and maintenance alarming method according to claim 1, wherein after the monitoring data corresponding to the resource in the target service is collected by the preset monitoring collector, the method further comprises:
if the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, searching and storing error information of a data calling request corresponding to the collected data sent by the preset monitoring collector through request calling link analysis;
correspondingly, the determining, by using the target monitoring data, a target alarm root cause of an alarm event for the target resource according to the association relationship between the resources in the target service, so as to perform corresponding alarm and fault repair according to the target alarm root cause, includes:
and determining a target alarm root cause of an alarm event aiming at the target resource according to the incidence relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
3. The operation and maintenance alarm method according to claim 2, wherein the searching and storing the error information of the data call request corresponding to the collected data sent by the preset monitoring collector through the request call link analysis comprises:
determining a path corresponding to the data calling request in the system according to a unique request identity identification number corresponding to the data calling request corresponding to the collected data sent by the preset monitoring collector;
and searching for a call failure event of the data call request in the call process in the path, and storing error information corresponding to the call failure event.
4. The operation and maintenance warning method according to claim 1, wherein after the target resource of which the monitoring data exceeds the preset warning threshold value is obtained and whether the target resource is a resource of the same type is judged according to the target monitoring data corresponding to the target resource, the method further comprises:
if the target resources are not the same type of resources, judging a target alarm importance level corresponding to the alarm event of the target resources according to the relation between the target monitoring data and a preset alarm importance;
and alarming aiming at the alarm event of the target resource according to the target alarm importance level.
5. The operation and maintenance alarm method according to claim 4, wherein after determining a target alarm root of an alarm event for the target resource according to the association relationship between the resources in the target service by using the target monitoring data if the target resource is the same type of resource, the method further comprises:
judging the target alarm importance level corresponding to the alarm event aiming at the target resource according to the target alarm root factor based on the preset alarm importance;
if the target alarm importance level is higher than the preset level threshold, alarming the alarm event of the target resource;
and if the target alarm importance level is not higher than the preset level threshold, directly alarming the alarm event of the target resource by waiting for the alarm time point after the next preset alarm period.
6. The operation and maintenance alerting method of claim 1, wherein the determining, by using the target monitoring data and according to an association relationship between resources in the target service, a target alerting root cause of an alerting event for the target resource comprises:
determining the alarm event corresponding to the target monitoring data by using the target monitoring data, and deducing the current state of the target resource according to the alarm event;
and analyzing the reason of the alarm event according to the current state of the target resource, and determining a target alarm root factor of the alarm event aiming at the target resource according to the incidence relation among the resources in the target service.
7. The operation and maintenance alarm method according to any one of claims 1 to 6, wherein after determining a target alarm root of an alarm event for the target resource according to the association relationship between the resources in the target service by using the target monitoring data if the target resource is the same type of resource, the method further comprises:
determining a corresponding preset fault repairing script according to the target alarm root cause, and repairing the fault corresponding to the target alarm root cause through the preset fault repairing script;
if the target alarm root is successfully repaired due to the corresponding fault, alarm recovery is carried out and a corresponding fault repair success event is pushed;
and if the fault corresponding to the target alarm root factor is not successfully repaired, pushing the target alarm root factor to relevant operation and maintenance personnel to adjust the preset fault repairing script or manually perform corresponding fault repairing.
8. An operation and maintenance warning device, comprising:
the data acquisition module is used for acquiring monitoring data corresponding to resources in a target service through a preset monitoring acquisition unit, and if the monitoring data is acquired, comparing the monitoring data with a preset alarm threshold value;
the type judgment module is used for acquiring the target resource of which the monitoring data exceeds the preset alarm threshold value and judging whether the target resource is the same type of resource or not according to the target monitoring data corresponding to the target resource;
and the root cause determining module is used for determining a target alarm root cause of an alarm event aiming at the target resource according to the incidence relation among the resources in the target service by utilizing the target monitoring data if the target resource is the same type of resource so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the operation and maintenance alarm method according to any one of claims 1 to 7.
10. A computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the steps of the operation and maintenance alarm method according to any of claims 1 to 7.
CN202210764564.7A 2022-06-30 2022-06-30 Operation and maintenance warning method, device, equipment and medium Pending CN115174350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210764564.7A CN115174350A (en) 2022-06-30 2022-06-30 Operation and maintenance warning method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210764564.7A CN115174350A (en) 2022-06-30 2022-06-30 Operation and maintenance warning method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115174350A true CN115174350A (en) 2022-10-11

Family

ID=83490040

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210764564.7A Pending CN115174350A (en) 2022-06-30 2022-06-30 Operation and maintenance warning method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115174350A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471215A (en) * 2022-10-31 2022-12-13 江西省煤田地质局普查综合大队 Business process processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020111755A1 (en) * 2000-10-19 2002-08-15 Tti-Team Telecom International Ltd. Topology-based reasoning apparatus for root-cause analysis of network faults
US8738972B1 (en) * 2011-02-04 2014-05-27 Dell Software Inc. Systems and methods for real-time monitoring of virtualized environments
CN111814999A (en) * 2020-07-08 2020-10-23 上海燕汐软件信息科技有限公司 Fault work order generation method, device and equipment
CN112148772A (en) * 2020-09-24 2020-12-29 创新奇智(成都)科技有限公司 Alarm root cause identification method, device, equipment and storage medium
CN114327964A (en) * 2020-10-10 2022-04-12 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for processing fault reasons of service system
CN114443437A (en) * 2022-01-28 2022-05-06 中国建设银行股份有限公司 Alarm root cause output method, apparatus, device, medium, and program product

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020111755A1 (en) * 2000-10-19 2002-08-15 Tti-Team Telecom International Ltd. Topology-based reasoning apparatus for root-cause analysis of network faults
US8738972B1 (en) * 2011-02-04 2014-05-27 Dell Software Inc. Systems and methods for real-time monitoring of virtualized environments
CN111814999A (en) * 2020-07-08 2020-10-23 上海燕汐软件信息科技有限公司 Fault work order generation method, device and equipment
CN112148772A (en) * 2020-09-24 2020-12-29 创新奇智(成都)科技有限公司 Alarm root cause identification method, device, equipment and storage medium
CN114327964A (en) * 2020-10-10 2022-04-12 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for processing fault reasons of service system
CN114443437A (en) * 2022-01-28 2022-05-06 中国建设银行股份有限公司 Alarm root cause output method, apparatus, device, medium, and program product

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵纪刚;张超;丁建立;王静;: "民航旅客服务信息系统告警关联规则挖掘", 计算机应用与软件, no. 04 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471215A (en) * 2022-10-31 2022-12-13 江西省煤田地质局普查综合大队 Business process processing method and device

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
CN107196804B (en) Alarm centralized monitoring system and method for terminal communication access network of power system
CN111176879A (en) Fault repairing method and device for equipment
CN101808351B (en) Method and system for business impact analysis
CN105207806A (en) Monitoring method and apparatus of distributed service
CN102355368B (en) Fault processing method of network equipment and system
CN112631913B (en) Method, device, equipment and storage medium for monitoring operation faults of application program
CN109150572B (en) Method, device and computer readable storage medium for realizing alarm association
CN113807549A (en) Alarm message pushing method, device, equipment and storage medium
CN112954031B (en) Equipment state notification method based on cloud mobile phone
CN116719664B (en) Application and cloud platform cross-layer fault analysis method and system based on micro-service deployment
CN112350854B (en) Flow fault positioning method, device, equipment and storage medium
WO2015187001A2 (en) System and method for managing resources failure using fast cause and effect analysis in a cloud computing system
CN115174350A (en) Operation and maintenance warning method, device, equipment and medium
CN115001989A (en) Equipment early warning method, device, equipment and readable storage medium
CN111865673A (en) Automatic fault management method, device and system
CN113656252B (en) Fault positioning method, device, electronic equipment and storage medium
CN109831335B (en) Data monitoring method, monitoring terminal, storage medium and data monitoring system
CN111970151A (en) Flow fault positioning method and system for virtual and container network
CN115016976B (en) Root cause positioning method, device, equipment and storage medium
CN111162938A (en) Data processing system and method
CN113472858B (en) Buried point data processing method and device and electronic equipment
CN115150252A (en) Network fault detection method, system and equipment
CN114598622A (en) Data monitoring method and device, storage medium and computer equipment
CN109412861B (en) Method for establishing security association display of terminal network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination