WO2023138058A1 - 一种告警事件的处理方法、装置及计算机可读存储介质 - Google Patents

一种告警事件的处理方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2023138058A1
WO2023138058A1 PCT/CN2022/115339 CN2022115339W WO2023138058A1 WO 2023138058 A1 WO2023138058 A1 WO 2023138058A1 CN 2022115339 W CN2022115339 W CN 2022115339W WO 2023138058 A1 WO2023138058 A1 WO 2023138058A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
alarm
resource
alarm event
monitored
Prior art date
Application number
PCT/CN2022/115339
Other languages
English (en)
French (fr)
Inventor
武警贺
闫冬冬
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023138058A1 publication Critical patent/WO2023138058A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs

Definitions

  • the present application relates to the field of computer technology, and in particular to a method, device and non-volatile computer-readable storage medium for processing an alarm event.
  • Virtualization solves this problem by consolidating multiple servers into a single server, running multiple virtual environments, which ultimately saves physical space.
  • the server virtualization platform manages a large number of device resources, including physical resources such as underlying server hosts, disks, and networks, as well as virtual resources such as virtual machines, shared storage, and virtual networks that users build spontaneously according to business needs.
  • the virtualization platform is updating and receiving monitoring data of all resources (including physical resources and virtual resources) and reporting events occurring at the bottom of various resources all the time.
  • the inventor realizes that in order to allow users to understand the operating status of the system (a system formed by integrating servers through virtualization technology), the current method is to directly display the events that trigger alarms (hereinafter referred to as alarm events) indiscriminately to users.
  • alarm events the events that trigger alarms
  • this method allows users to understand the operating status of the system, among the events that trigger alarms, some are alarm events that are more harmful to system operation and need to be resolved urgently, while others are alarm events that are less harmful to system operation and can be delayed. If the current processing method is used, it is impossible for users to make a reasonable processing sequence for alarm events, and even cause system interruption or downtime, reducing the reliability of system operation.
  • a method for processing an alarm event including the following steps: acquiring an event generated during operation of a resource to be monitored;
  • the alarm parameter includes at least a target usage frequency corresponding to the alarm event corresponding to the resource to be monitored in the current time period, and an influencing factor used to characterize the severity of the alarm event;
  • the priorities of the obtained alarm events are determined according to the obtained alarm parameters so as to be displayed in the order of the determined priorities when a viewing request from the user is received.
  • an alarm event processing device including:
  • the first obtaining module is used to obtain events generated during the operation of the resource to be monitored
  • a judging module configured to judge whether the acquired event satisfies the alarm condition; trigger the second acquisition module in response to the event satisfying the alarm condition;
  • the second acquiring module is used to acquire the alarm parameters set for the alarm events satisfying the alarm conditions, wherein the alarm parameters include at least the target usage frequency of the resource to be monitored corresponding to the alarm event in the current time period, and the influence factor used to characterize the severity of the alarm event;
  • the determination module is configured to determine the priority of the obtained alarm events according to the obtained alarm parameters, so as to display them in the order of the determined priorities when a viewing request from the user is received.
  • the present application also provides a processing device for alarm events, including a memory for storing computer-readable instructions;
  • One or more processors configured to implement the steps of the above-mentioned alarm event processing method when executing computer-readable instructions.
  • the present application also provides a non-volatile computer-readable storage medium, on which computer-readable instructions are stored, and when the computer-readable instructions are executed by one or more processors, the steps of the above-mentioned alarm event processing method are implemented.
  • FIG. 1 is a flowchart of a method for processing an alarm event provided by one or more embodiments of the present application
  • Fig. 2 is a flow chart of another alarm event processing method provided by one or more embodiments of the present application.
  • FIG. 3 is a schematic diagram of functional modules corresponding to a method for processing an alarm event provided by one or more embodiments of the present application;
  • FIG. 4 is a structural diagram of an alarm event processing device provided by one or more embodiments of the present application.
  • Fig. 5 is a structural diagram of an apparatus for processing an alarm event provided by one or more embodiments of the present application.
  • the core of the present application is to provide an alarm event processing method, device and non-volatile computer-readable storage medium. It should be noted that the method for processing alarm events mentioned in this application can be applied to a single server or a server cluster, and can also be applied to a virtualization platform. Since a large number of device resources are managed in the virtualization platform, many alarm events are involved, so this method is especially applicable to this scenario.
  • Fig. 1 is a flow chart of a method for processing an alarm event provided by the embodiment of the present application. It is worth noting that this application is mainly applied to the processing and sorting of various types of alarms caused by different times in the server virtualization platform. As shown in Fig. 1, the processing method of the alarm event includes the following steps.
  • step S11 Determine whether the acquired event satisfies the alarm condition, and if so, go to step S12.
  • S13 Determine the priorities of the obtained alarm events according to the obtained alarm parameters, so as to display them in the order of the determined priorities when a viewing request from the user is received.
  • the resources to be monitored mentioned in this step can be devices such as a central processing unit (Central Processing Unit, CPU), a graphics card, and an image processor (Graphics Processing Unit, GPU).
  • an event Events can be divided into two categories, one is the utilization rate of the resources to be monitored, which can be called threshold alarm events, and the other is the occurrence of resources to be monitored, which can be called event alarm events.
  • step S11 it is judged whether the acquired event satisfies the alarm condition, and if so, go to step S12. It is worth noting that this embodiment does not limit the alarm condition. It can be judged by whether the utilization rate of the resource to be monitored exceeds the preset range and whether the occurrence of the resource to be monitored is in a normal state, but it is not limited to this method to judge whether the event meets the alarm condition.
  • step S12 when the event satisfies the alarm condition, the parameters set for the alarm event satisfying the alarm condition are acquired, and the alarm parameter includes at least the target usage frequency corresponding to the resource to be monitored corresponding to the alarm event in the current time period, and an influencing factor used to characterize the severity of the alarm event.
  • the usage frequency corresponding to each time period of the resource to be monitored corresponding to the alarm event is recorded, and the target usage frequency refers to the usage frequency of the time period when the resource to be monitored has an alarm event.
  • the frequency of target use is equivalent to evaluating the alarm event from time, while the influencing factor used to characterize the severity of the alarm event is to evaluate the alarm event from space, so this method evaluates the severity of the alarm event through the combination of time and space.
  • this embodiment does not limit the selection of the impact factor. It may be a preset weight coefficient for this event, or it may be the time when the utilization rate of the resource to be monitored exceeds the threshold. This embodiment does not limit the specific content of the impact factor. The impact factor can be selected according to the specific implementation situation.
  • step S13 it is mentioned in step S13 to determine the priority of the obtained alarm events according to the obtained alarm parameters, so as to display them according to the determined order of priority when a viewing request from the user is received.
  • the setting of the priority of the alarm event is through the alarm parameter.
  • a value can be obtained by adding or multiplying each alarm parameter, and finally the size of the priority is determined according to the size of this value.
  • the alarm parameter is A, B, and C.
  • the product or sum of ABC can be used as the priority of the alarm event, but it is not limited to the form of multiplying or adding the alarm parameters.
  • there is no limitation on how to display the alarm events according to the priority They can be arranged in descending order of priority, or in ascending order of priority. You can choose how to sort the alarm events according to the priority according to the specific implementation situation.
  • the alarm parameter set for the alarm event meeting the alarm condition is acquired.
  • the alarm parameter includes at least a target usage frequency of the resource to be monitored corresponding to the alarm event in the current time period, and an impact factor used to characterize the severity of the alarm event.
  • the adoption of the above technical solution combines the target usage frequency of the resource to be monitored and the impact factor used to characterize the severity of the alarm event, which is equivalent to evaluating the alarm event from time and space, so it can reflect the degree of impact of the alarm event on the system operation to a greater extent. Therefore, determining the priority of each alarm event by this method allows the user to promptly determine the alarm event that has a greater impact on the system and take priority to take processing measures, so that the reliability of the system operation can be improved.
  • the alarm events can be divided into two categories, one is the alarm event when the utilization rate of the resource to be monitored exceeds the threshold, which is called the threshold alarm event, and the other is the alarm event when the occurrence of the resource to be monitored is in an abnormal state, which is called the event alarm event.
  • the event alarm event there are two types of events that are generated during the operation of the resources to be monitored. It is possible to obtain the monitoring values of the items to be monitored of the resources to be monitored or the events reported by the bottom layer of the resources.
  • the monitoring value of the item to be monitored of the resource to be monitored is obtained, judging whether the event satisfies the alarm condition becomes judging whether the monitoring value exceeds the threshold. If yes, it is determined that the current event satisfies the alarm condition, that is, it is an alarm event; if not, it is determined that the current event does not meet the alarm condition, that is, it is not an alarm event.
  • this embodiment does not limit the size of the threshold, and the size of the threshold is related to the resources to be monitored. For example, if the CPU usage rate exceeds 80%, it is an alarm event, but if the GPU usage rate exceeds 85%, it does not constitute an alarm event.
  • the threshold there is no limit to the size of the threshold, and different resources to be monitored can correspond to the same threshold or different thresholds.
  • the appropriate threshold can be selected according to the resource to be monitored corresponding to the alarm event.
  • the threshold is usually preset, and can also be dynamically set according to the actual situation, which is also within the protection scope of the present application.
  • the judgment of whether the event meets the alarm condition becomes, and it is judged whether the report time of the bottom layer of the resource is an event set on the alarm blacklist. If yes, it is determined that the current event satisfies the alarm condition, that is, it is an alarm event; if not, it is determined that the current event does not meet the alarm condition, that is, it is not an alarm event.
  • the events set on the alarm blacklist are unexpected events that have occurred before, and events that have not occurred but are in an abnormal state, such as the sudden disconnection of the host network card, the sudden failure of the CPU, etc., are the events set on the alarm blacklist. It should be noted that the events set on the alarm blacklist can be modified according to actual conditions, for example, adding events or deleting events, which are also within the protection scope of the present application.
  • the events generated during the operation of the resource to be monitored there are two possibilities for acquiring the events generated during the operation of the resource to be monitored. One is to obtain the monitoring value of the item to be monitored of the resource to be monitored, and the other is to obtain the event reported by the bottom layer of the resource. The two situations are analyzed respectively. When the event is a monitoring value, compare the monitoring value with the threshold value to determine whether the current event is an alarm event.
  • the influencing factors used to characterize the severity of the alarm event may also be different. Considering the occurrence of this situation, this embodiment describes the influencing factors, specifically:
  • the influencing factor used to characterize the severity of the alarm event is: the first weight corresponding to the difference between the monitoring value and the threshold, and the difference is positively correlated with the first weight. It can be understood that when the difference between the monitoring value and the threshold is greater, the first weight is greater, which proves that the severity of the current event is relatively high.
  • the first weight corresponding to the difference between the monitoring value and the threshold is only a preferred embodiment, and it is not limited.
  • the weight corresponding to the time when the monitoring value is greater than the threshold can also be used as the first weight, that is, the length of the alarm time as the first weight. This embodiment does not limit this, and the first weight may be selected according to specific implementation conditions. It should be noted that the first weight is usually preset, but it can be dynamically set according to the actual situation, which is also within the protection scope of the present application.
  • the usage rate of the CPU is 90%
  • the threshold is 80%
  • the difference between the two is 10%.
  • the corresponding first weight is 0.5, but it can be changed according to the actual situation.
  • the first weight may be adjusted up during a busy business period and set to 0.6, while the first weight may be adjusted down during a non-busy business period and set to 0.4.
  • the size of the first weight may be determined according to factors such as the importance of the resource to be monitored and whether the current business is busy.
  • the influencing factor used to characterize the severity of the alarm event is: the second weight corresponding to the accumulated number of reported alarm events, wherein the accumulated number of reported times is positively correlated with the second weight.
  • this embodiment is only used as a preferred implementation mode, and the obtained event is limited to the content of the impact factor of the event reported by the bottom layer of the resource, but it is not limited to this mode, and the content of the impact factor can be selected according to the specific implementation situation.
  • the second weight is usually preset, but can be dynamically set according to actual conditions, which is also within the protection scope of the present application. For specific settings, refer to the first weight, which will not be repeated here.
  • the influencing factor used to characterize the severity of the alarm event is selected according to the type of the alarm event.
  • the influencing factor is the first weight corresponding to the difference between the monitoring value and the threshold.
  • the impact factor is the second weight corresponding to the accumulated reporting times of the alarm event.
  • the method for obtaining the target usage frequency is described.
  • the following table shows the usage frequency of each resource event to be monitored at each event end, as follows:
  • resource 1 to be monitored may have a high usage frequency at 01.00, but an alarm event occurs at 02.00, and the usage frequency of another resource to be monitored at 02.00 is higher than that of resource 1 to be monitored at 02.00. Therefore, obtaining the usage frequency of the corresponding time period can make priority setting more rigorous, and can better reflect the severity of the current event.
  • the usage frequency of each resource to be monitored is recorded in each time period, and the target usage frequency corresponds to the corresponding time period when the event occurs.
  • R0, R1, R2, R22 and R23 respectively represent the usage frequency of the resource to be monitored in the corresponding time period.
  • the corresponding target frequency is R2 corresponding to the resource to be monitored corresponding to 1 . It is worth noting that the frequencies of R1 and R2 in each resource to be monitored are different.
  • R1 represents only the usage frequency of the resource object to be monitored at the corresponding time, and the specific value of the usage frequency of each time period of the event corresponding to each resource to be monitored will not be described in this embodiment.
  • the acquisition of the usage frequency of each resource to be monitored in each time period is obtained through a linear fitting algorithm, but is not limited to this method.
  • the method for obtaining the target frequency of use is to obtain the target time period to which the alarm event occurs and the target resource corresponding to the alarm event, and select the use frequency corresponding to the target time period and the target resource as the target use frequency in the correspondence relationship including each resource, each time period and each use frequency. It can be seen that this method calculates the use frequency of the events corresponding to the resources to be monitored in each time period, and determines the corresponding use frequency through the time when the event occurs, so as to obtain the priority of the current event. This method ensures the rigor of priority setting and better reflects the severity of alarm events.
  • Fig. 2 is the processing method of another kind of warning event that the embodiment of the present application provides, as shown in Fig. 2, in order to prevent the generation of this situation, after step S11, also include before S12:
  • step S14 Determine whether the current alarm event and the event determined as the alarm event belong to the same monitoring item of the same resource or whether the event determined as the alarm event belongs to the bottom layer reporting event of the same resource, and if so, go to step S15.
  • the determined alarm event may have been determined as an alarm event, resulting in repeated alarms. Therefore, as described in step S14, it is first judged whether the current alarm event and the event determined as an alarm event belong to the same monitoring item of the same resource or whether the event determined as an alarm event belongs to the bottom layer reporting event of the same resource.
  • the same monitoring item corresponds to the event corresponding to the utilization rate of the resource to be monitored. If the event that has been determined as an alarm event belongs to the event reported by the bottom layer of the same resource, it corresponds to the event corresponding to the occurrence of the resource to be monitored. That is to say, this embodiment makes corresponding processing for both cases.
  • step S16 Determine whether the current event that does not meet the alarm condition and the event that has been determined to be an alarm event belong to the same monitoring item of the same resource, or whether the event that has been determined to be an alarm event belongs to the bottom layer reporting event of the same resource, and if so, go to step S17.
  • This embodiment provides that when the current event does not meet the alarm condition, it is judged whether the current event and the event determined as the alarm event belong to the same monitoring item of the same resource, or whether the event determined as the alarm event belongs to the same resource bottom-level reporting event, which avoids the possibility that the event that has become an alarm event has been eliminated but is still recorded, and improves the accuracy of recording alarm events.
  • this embodiment does not limit the preset number of times, and the preset number of times can be selected according to the specific implementation situation. In addition, this embodiment only provides a preferred implementation mode, but it is not limited to this kind of judgment method. It can also be judged according to the continuous time when the monitored value is lower than the threshold value, and this embodiment will not repeat it.
  • the number of times the monitoring value is lower than the threshold is judged, avoiding the instability caused by the jump of the monitoring value, and also avoiding the accidental deletion of the alarm event, and improving the accuracy of determining the alarm event.
  • the alarm parameters also include user-defined alarm coefficients, and limit the priority setting, as follows:
  • the priority corresponding to the threshold alarm event is the product of the target frequency of the current event, the first weight, and the user-defined alarm coefficient as the priority of the current event.
  • the product of the target usage frequency of the current event, the second weight and the user-defined alarm coefficient is used as the priority of the current event.
  • the user-defined alarm coefficient is set by the user, and a larger number can be set for devices that are used more, and a smaller number can be set for devices that are used less.
  • this embodiment determines the priority of the event based on the product of the three, but it is not limited to this method, and the corresponding weight can also be added to the three, and finally the product of the three information after adding the weight can also be used as the priority of the current event.
  • the alarm parameters provided in this embodiment also include custom alarm coefficients, and the custom alarm coefficients are set by the user.
  • the priority is determined by the custom alarm coefficient, the first weight, and the target frequency of use for the events that obtain the monitored values, and the priority is determined by the custom alarm coefficient, the second weight, and the target frequency of use for the events reported at the bottom layer, ensuring the fairness of each alarm event.
  • FIG. 3 is a schematic diagram of functional modules corresponding to a method for processing an alarm event provided by an embodiment of the present application.
  • the functional modules participating in the method for processing an alarm event include an alarm sorting device 1, a space-time priority evaluation device 2, a resource busyness time distribution table 3, an alarm event reporting device 4, a monitoring threshold research and judgment device 5, and a resource monitoring storage medium 6.
  • the resource monitoring storage medium 6 saves the monitoring records of all resources to be monitored according to a fixed sampling period, and continuously updates the latest data and clears the earliest data.
  • the monitoring threshold research and judgment device 5 continuously reads the monitoring storage medium and compares the difference between the monitoring value of the alarm event and the threshold value, which is the determination method of the threshold type alarm event mentioned in the above embodiment.
  • the alarm event reporting device 4 is responsible for collecting various events sent by the bottom layer of resources, and screening, processing and reporting alarm-related events to the system.
  • the resource busyness time distribution table 3 uses a linear fitting algorithm to calculate the usage frequency of each resource in the system at different times of the day based on the recent historical resource busyness time distribution data through a linear fitting algorithm.
  • the spatio-temporal priority evaluation device 2 grades the importance of the alarm according to the spatio-temporal data of the alarm into the parameter model, and dynamically updates it according to the latest monitoring values and event reporting.
  • the method for processing an alarm event is described in detail, and the present application also provides an embodiment corresponding to an apparatus for processing an alarm event. It should be noted that this application describes the embodiments of the device part from two perspectives, one is based on the perspective of functional modules, and the other is based on the perspective of hardware.
  • FIG. 4 is a structural diagram of an alarm event processing device provided in an embodiment of the present application. As shown in FIG. 4 , the alarm event processing device includes:
  • the first acquisition module 10 is configured to acquire events generated during operation of the resource to be monitored.
  • the judging module 11 is used to judge whether the acquired event satisfies the warning condition; if so, trigger the second acquiring module.
  • the second acquiring module 12 is configured to acquire an alarm parameter set for an alarm event that satisfies the alarm condition, wherein the alarm parameter includes at least a target usage frequency of the resource to be monitored corresponding to the alarm event in the current time period, and an impact factor used to characterize the severity of the alarm event.
  • the determining module 13 is configured to determine the priorities of the obtained alarm events according to the obtained alarm parameters so as to display them in the order of the determined priorities when a viewing request from the user is received.
  • FIG. 5 is a structural diagram of an alarm event processing device provided in another embodiment of the present application.
  • the alarm event processing device includes: a memory 20 for storing computer-readable instructions;
  • the processor 21 is configured to implement the steps of the method for processing an alarm event as mentioned in the foregoing embodiments when executing computer-readable instructions.
  • the device for processing an alarm event may include, but not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
  • the processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 21 may be realized by at least one hardware form of a digital signal processor (Digital Signal Processor, DSP), a field-programmable gate array (Field-Programmable Gate Array, FPGA), and a programmable logic array (Programmable Logic Array, PLA).
  • DSP Digital Signal Processor
  • FPGA Field-Programmable Gate Array
  • PLA programmable logic array
  • the processor 21 may also include a main processor and a coprocessor.
  • the main processor is a processor for processing data in a wake-up state, also called a central processing unit; the coprocessor is a low-power processor for processing data in a standby state.
  • the processor 21 may be integrated with a GPU, and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 21 may also include an artificial intelligence (Artificial Intelligence, AI) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • Memory 20 may include one or more non-volatile computer-readable storage media, which may be non-transitory.
  • the memory 20 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices.
  • the memory 20 is at least used to store the following computer-readable instructions 201, wherein, after the computer-readable instructions are loaded and executed by the processor 21, relevant steps of the method for processing an alarm event disclosed in any one of the foregoing embodiments can be implemented.
  • the resources stored in the memory 20 may also include an operating system 202 and data 203, etc., and the storage method may be temporary storage or permanent storage.
  • the operating system 202 may include Windows, Unix, Linux and so on.
  • the data 203 may include but not limited to the data of the processing method of the alarm event and the like.
  • the alarm event processing device may further include a display screen 22 , an input/output interface 23 , a communication interface 24 , a power supply 25 and a communication bus 26 .
  • FIG. 5 does not constitute a limitation on the apparatus for processing alarm events, and may include more or less components than those shown in the illustration.
  • the present application also provides an embodiment corresponding to a non-volatile computer-readable storage medium.
  • the non-volatile computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by one or more processors, the steps described in the above method embodiments are implemented.
  • the methods in the above embodiments are implemented in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium and executes all or part of the steps of the method in each embodiment of the application.
  • the aforementioned storage media include: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Alarm Systems (AREA)

Abstract

本申请公开了一种告警事件的处理方法、装置及计算机可读存储介质,当判断待监控资源在运行过程中产生的事件满足告警条件时,获取对满足告警条件的事件设置的告警参数(S12)。告警参数至少包括与事件对应的待监控资源在当前时间段对应的目标使用频率和表征告警事件严重程度的影响因子。根据告警参数确定告警事件的优先级以便接收到用户查看请求时按照优先级展示(S13)。

Description

一种告警事件的处理方法、装置及计算机可读存储介质
相关申请的交叉引用
本申请要求于2022年01月21日提交中国专利局,申请号为202210073375.5,申请名称为“一种告警事件的处理方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种告警事件的处理方法、装置及非易失性计算机可读存储介质。
背景技术
大多数企业将每台服务器用于一个特定的任务或者应用程序,因为这些不同的应用或者程序并不适用于同一个系统中,但问题是,但多数服务器在运行计算时只会使用他们整体处理能力的一小部分,不能充分利用服务器的处理能力。虚拟化就解决了这个问题,将多台服务器整合到一台服务器中,运行多个虚拟环境,最终将节省物理空间。服务器的虚拟化平台管理着大量设备资源,其中包括底层的服务器主机、磁盘、网络等物理资源,以及由用户根据业务需求自发搭建的虚拟机、共享存储、虚拟网络等虚拟资源。虚拟化平台时刻都在更新接收着所有资源(包括物理资源和虚拟资源)的监控数据和各类资源底层发生的上报事件。
然而,发明人意识到,为了能够让用户了解到系统(由各服务器通过虚拟化技术整合在一起所形成的系统)的运行状态,当前的方式是直接将触发告警的事件(后文简称告警事件)无差别的展示给用户。虽然该方式能够让用户了解到系统的运行状态,但是在这些触发告警的事件中,有的是对系统运行危害程度较大的告警事件,这类事件为亟待解决的事件,而有的是对系统运行危害程度较小的告警事件,这类事件为可以延缓处理的事件,如果按照当前的处理方式,对于用户来说,不能对告警事件做出合理的处理顺序,甚至会引发系统中断或宕机的问 题,降低了系统运行的可靠性。
发明内容
本申请的一方面,提供了一种告警事件的处理方法,包括以下步骤:获取待监控资源在运行过程中产生的事件;
判断所获取的事件是否满足告警条件;
响应于所获取的事件满足告警条件,获取对满足告警条件的告警事件所设置的告警参数,其中,告警参数至少包括与告警事件对应的待监控资源在当前时间段对应的目标使用频率,以及用于表征告警事件的严重程度的影响因子;及
根据所得到的告警参数确定所得到的告警事件的优先级以便于在接收到用户查看请求时按照所确定的优先级顺序展示。
相应的,本申请还提供一种告警事件的处理装置,包括:
第一获取模块,用于获取待监控资源在运行过程中产生的事件;
判断模块,用于判断所获取的事件是否满足告警条件;响应于事件满足告警条件,触发第二获取模块;
第二获取模块,用于获取对满足告警条件的告警事件所设置的告警参数,其中,告警参数至少包括与告警事件对应的待监控资源在当前时间段对应的目标使用频率,以及用于表征告警事件的严重程度的影响因子;及
确定模块,用于根据所得到的告警参数确定所得到的告警事件的优先级以便于在接收到用户查看请求时按照所确定的优先级顺序展示。
为解决上述技术问题,本申请还提供一种告警事件的处理装置,包括存储器,用于存储计算机可读指令;
一个或多个处理器,用于执行计算机可读指令时实现如上述的告警事件的处理方法的步骤。
为解决上述技术问题,本申请还提供一种非易失性计算机可读存储介质,非易失性计算机可读存储介质上存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时实现如上述的告警事件的处理方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例,下面将对实施例中所需要使用的附图做简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请一个或多个实施例提供的一种告警事件的处理方法的流程图;
图2为本申请一个或多个实施例提供的另一种告警事件的处理方法的流程图
图3为本申请一个或多个实施例提供的一种告警事件的处理方法对应的功能模块示意图;
图4为本申请一个或多个实施例提供的一种告警事件的处理装置的结构图;
图5为本申请一个或多个实施例提供的告警事件的处理装置的结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下,所获得的所有其他实施例,都属于本申请保护范围。
本申请的核心是提供一种告警事件的处理方法、装置及非易失性计算机可读存储介质。需要说的是,本申请所提到的告警事件的处理方法可以应用于单个服务器或服务器集群,也可以应用于虚拟化平台。由于虚拟化平台中管理着大量设备资源,故涉及到的告警事件较多,所以本方法尤其适用于该场景。
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。
图1为本申请实施例提供的一种告警事件的处理方法的流程图,值得注意的是,本申请主要应用于服务器虚拟化平台中各类不同时间所引发告警的处理与排序展示,如图1所示,告警事件的处理方法包括如下步骤。
S10:获取待监控资源在运行过程中产生的事件。
S11:判断所获取的事件是否满足告警条件,若是,进入S12步骤。
S12:获取对满足告警条件的告警事件所设置的告警参数。
S13:根据所得到的告警参数,确定所得到的告警事件的优先级,以便于在接收到用户查看请求时按照所确定的优先级顺序展示。
对于步骤S10所说,获取待监控资源在运行过程中产生的事件,本步骤中所提到的待监控资源可以为中央处理器(Central Processing Unit,CPU)、显卡、图像处理器(Graphics Processing Unit,GPU)等器件,而待监控资源在运行过程中产生的事件是指CPU的使用率、显卡的使用率、GPU的使用率或主机网卡断开等事件,也就是说待监控资源的使用率为多少是一个事件,待监控资源的发生情况也为一个事件。可以将事件分为两大类,一类是待监控资源使用率的多少,可称之为阈值类告警事件,另一类是待监控资源的发生情况,可以称之为事件类告警事件。此外,对于步骤S11所说,判断所获取的事件是否满足告警条件,若是,进入步骤S12。值得注意的是,本实施例对于告警条件不作限定,可以通过待监控资源的使用率是否超出预设范围,和待监控资源的发生情况是否处于正常状态来判断,但不限于这种方法判断事件是否满足告警条件。
此外,对于S12步骤所说,当事件满足告警条件时,获取对满足告警条件的告警事件所设置的参数,并且,告警参数至少包括与告警事件对应的待监控资源在当前时间段对应的目标使用频率,以及用于表征告警事件的严重程度的影响因子。其中,与告警事件对应的待监控资源在各个时间段对应的使用频率都有记录,而目标使用频率是指待监控资源发生告警事件的那段时间的使用频率。另外,目标使用频率相当于从时间上对告警事件进行评估,而用于表征告警事件的严重程度的影响因子,是从空间上对告警事件进行评估,故本方法是通过时间和空间相结合的方式对告警事件的严重程度进行评估。另外,本实施例对影响因子的选择不作限定,可以是对此事件预先设定好的权重系数,也可以是待监控资源使用率超出阈值的时间,本实施例对影响因子的具体内容不作限定,可以根据具体的实施情况对影响因子进行选择。
在此基础上,S13步骤中提到根据所得到的告警参数,确定所得到的告警事件的优先级,以便于在接收到用户查看请求时按照所确定的优先级顺序展示。本实施例对于告警事件优先级的设定是通过告警参数,可以是通过各个告警参数相 加或相乘的形式得出一个值,最后根据这个值的大小确定优先级的大小,具体的为,告警参数为A、B和C,那么ABC三者的乘积或和可以作为此告警事件的优先级,但不限于将告警参数相乘或相加的形式。另外,对于如何根据优先级对告警事件进行展示不作限定,可以按照优先级从大到小的顺序进行排列,也可以按照优先级从小到大的顺序进行排列,可以根据具体的实施情况,对如何根据优先级对告警事件排序做选择。
本实施例所提供的告警事件的处理方法,当判断出待监控资源在运行过程中产生的事件满足告警条件时,获取对满足告警条件的告警事件所设置的告警参数。其中,告警参数至少包括与告警事件对应的待监控资源在当前时间段对应的目标使用频率,以及用于表征告警事件的严重程度的影响因子。最后根据所得到告警参数,确定所得到的告警事件的优先级,以便于在接收到用户查看请求时按照所确定的优先级顺序展示。由此可见,采用上述技术方案,由于结合了待监控资源的目标使用频率,和用于表征告警事件的严重程度的影响因子,相当于从时间和空间上对告警事件进行评估,因此能够较大程度的反应出告警事件对于系统运行的影响程度,所以以此确定各告警事件的优先级,能够让用户及时确定出对于系统影响较大的告警事件进而优先采取处理措施,故能够提高系统运行的可靠性。
在上述实施例的基础上,对如何获取待监控资源在运行过程中产生的事件进行描述。可以将告警事件分为两大类,一类为待监控资源使用率超出阈值时为告警事件称为阈值类告警事件,另一类为待监控资源的发生情况处于非正常状态时为告警事件称为事件类告警事件。在本实施例中,获取待监控资源在运行过程中产生的事件即为两类,可能获取到的是待监控资源的待监控项的监控数值或资源底层上报事件。其中,当获取到的是待监控资源的待监控项的监控数值时,判断此事件是否满足告警条件则变为,判断监控数值是否超过阈值。若是,确定当前事件满足告警条件,即为告警事件,若否,则确定当前事件不满足告警条件,即不为告警事件。值得注意的是,本实施例对阈值的大小不作限定,阈值的大小与待监控资源相关。例如CPU的使用率超出80%属于告警事件,但GPU使用率超出85%却不构成告警事件,因此对于阈值的大小不作限定,且不同的待监控资源可以对应相同的阈值,也可以对应不同的阈值,可以根据告警事件对应的待监控 资源进行选择合适的阈值。另外,阈值通常是预先设定的,也可以根据实际情况动态设置,其也在本申请的保护范围内。
此外,当获取到的是资源底层上报事件时,判断此事件是否满足告警条件则变为,判断资源底层上报时间是否为告警黑名单上设定的事件。若是,则确定当前事件满足告警条件,即为告警事件,若不是,则确定当前事件不满足告警条件,即不为告警事件。值得注意的是,告警黑名单上设定的事件是之前发生过的突发事件,和未发生但发生就处于非正常状态的事件,例如主机网卡突然断开、CPU突然不工作等事件为告警黑名单上设定的事件。需要说明的是,告警黑名单上设定的事件可以根据实际情况修改,例如,增加事件或删除事件,其也在本申请的保护范围内。
本实施例所提供的获取待监测资源在运行过程中产生的事件有两种可能,一种为获取到的是待监控资源的待监控项的监控数值,另一种为资源底层上报事件,分别对两种情况进行了分析,当为监控数值时,通过监控数值和阈值比较,确定当前事件是否为告警事件,当为资源底层上报事件时,通过告警黑名单来确定当前事件是否为告警事件,可见此方法对两种情况都进行了分析,使确定告警事件更加准确。
在具体实施例中,当获取带监控资源在运行过程中产生的事件不同时,对应的用于表征告警事件的严重程度的影响因子可能也不同,考虑到这种情况的发生,本实施例对影响因子进行一个描述,具体的为:
当获取到的是监控数值,且监控数值对应的事件为告警事件时,用于表征告警事件的严重程度的影响因子为:监控数值与阈值的差值所对应的第一权重,且差值与第一权重呈正相关的关系。可以理解的是,当监控数值与阈值的差值越大,则第一权重就越大,就证明当前事件的严重程度比较高。此外,监控数值与阈值的差值所对应的第一权重仅仅是一种优选的实施方式,并不对其进行限定,也可以通过监控数值大于阈值的时间对应的权重作为第一权重,也就是告警时间的长短作为第一权重。本实施例对此不作限定,可以根据具体的实施情况对第一权重进行选择。需要说明的是,第一权重通常是预先设定的,但是可根据实际情况动态设置,其也在本申请的保护范围内。
例如,CPU的使用率为90%,阈值为80%,而二者的差值为10%。针对这一差值,通常情况下对应的第一权重为0.5,而根据实际情况,可以更改。例如,在业务繁忙期可以上调第一权重,设置为0.6,而在业务非繁忙期可以下调第一权重,设置为0.4。第一权重的大小可以根据待监控资源的重要程度以及当前业务是否繁忙等因素而定。
此外,当获取到的是资源底层上报事件,且该事件为告警事件,则用于表征告警事件的严重程度的影响因子为:告警事件的累计上报次数所对应的第二权重,其中累计上报次数与第二权重呈正相关关系。可以理解的是,当告警事件为资源底层上报事件时,每个告警事件都有一个上报次数,也就是此告警事件之前也发送过告警,而上报次数越多,则证明当前告警事件越严重。另外,本实施例仅仅作为一种优选的实施方式,对获取的事件为资源底层上报事件的影响因子的内容进行限定,但不限于这一种方式,可以根据具体的实施情况对影响因子的内容进行选择。需要说明的是,第二权重通常是预先设定的,但是可根据实际情况动态设置,其也在本申请的保护范围内。具体的设置可参考第一权重,此处不再赘述。
本实施例所提供的,用于表征告警事件的严重程度的影响因子是根据告警事件的类型进行选择,当为阈值类告警事件时,影响因子为监控数值与阈值的差值所对应的第一权重。当事件类告警事件时,影响因子为告警事件的累计上报次数所对应的第二权重,由此可见,此方法根据获取的事件不同,选择与事件对应的影响因子,保证了告警事件严重程度的严谨性,且获取影响因子的方法也较为简单,提高了整体的工作效率。
作为一种优选的实施方式,对目标使用频率的获取方法进行描述,下表是对各个待监控资源的事件在各个事件端的使用频率,具体如下:
获取发生告警事件所属的目标时间段以及告警事件对应的目标资源,在包含有各资源、各时间段和各使用频率的对应关系中选取与目标时间段和目标资源对应的使用频率作为目标使用频率。
可以理解的是,可能待监控资源1在01.00时的使用频率高,但是告警事件发生在02.00,而另一个待监控资源在02.00时的使用频率比待监控资源1在02.00的使用频率高,因此,获取对应时间段的使用频率可以使优先级的设定更加严谨, 更能体现出当前事件的严重程度。
值得注意的是,对于各个待监控资源的事件在各个时间段的使用频率都有记录,而目标使用频率对应的是事件发生时对应的时间段,如下表所示,R0、R1、R2以及R22和R23分别代表待监控资源对应时间段内的使用频率。具体的为,待监控资源1对应的事件在02.00时满足告警条件,则对应的目标频率为待监控资源对应1对应的R2。值得注意的是,每个待监控资源中的R1、R2等频率不相同,R1代表的仅仅为待监控资源对象在对应时间的使用频率,每个待监控资源对应的事件的各时间段的使用频率的具体的数值本实施例暂不赘述。此外,各待监控资源在各时间段的使用频率的获取是通过线性拟合算法得到,但不限于这种方式。
Figure PCTCN2022115339-appb-000001
本实施例所提供的目标使用频率的获取方法,是通过获取发生告警事件所属的目标时间段以及告警事件对应的目标资源,在包含有各资源、各时间段和各使用频率的对应关系中选取与目标时间段和目标资源对应的使用频率作为目标使用频率。可见,此方法将各个时间段中待监控资源对应的事件的使用频率都计算出来,并通过事件发生的时间来确定对应的使用频率,从而得出当前事件的优先级,此方法保证了优先级设定的严谨性,更好的体现出告警事件的严重程度。
在具体实施例中,根据获取的事件对应的类型,确定出当前事件是否为告警事件,但是可能存在此事件已经确定为告警事件的情况,产生重复告警的情况。图2为本申请实施例提供的另一种告警事件的处理方法,如图2所示,为了防止 这种情况的发生,在步骤S11之后,S12之前还包括:
S14:判断当前的告警事件与已确定为告警事件的事件是否所属于同一资源的同一监控项或与已确定为告警事件的事件是否属于同一资源底层上报事件,若是,进入步骤S15。
S15:删除对应的已确定为告警事件的事件。
可以理解的是,在具体实施例中,所确定的告警事件可能已经被确定为告警事件,造成重复告警的情况,因此如S14步骤所说,首先判断当前的告警事件与已确定为告警事件的事件是否所属于同一资源的同一监控项或与已确定为告警事件的事件是否所属于同一资源底层上报事件,也就是说根据上述实施例所提到的告警事件的两大类来判断,而当前告警事件与已确定为告警事件的事件属于同一资源的同一监控项,就对应的是待监控资源的使用率对应的事件。若与已确定为告警事件的事件属于同一资源底层上报事件,则对应的是待监控资源的发生情况对应的事件。也就是说,本实施例对于两种情况都做出了对应的处理。
本实施例时提出的,在确定为告警事件之后,判断当前的告警事件与已确定为告警事件的事件是否所属于同一资源的同一监控项,或与已确定为告警事件的事件是否属于同一资源底层上报事件,避免了告警事件重复告警的情况发生,提高了确定告警事件的严谨性。
在具体实施例中,存在有些事件在01.00确定为告警事件,但下一时刻已经消除告警,已经解除危险,但仍记录为告警事件就会浪费资源,考虑到这种情况,如图2所示,若当前事件不满足告警条件,则还包括:
S16:判断当前的不满足告警条件的事件与已确定为告警事件的事件,是否所属于同一资源的同一监控项,或与已确定为告警事件的事件是否所属于同一资源底层上报事件,若是,进入S17步骤。
S17:删除对应的已确定为告警事件的事件。
可以理解的是,当此事件不满足告警条件时,还需要对此事件进行一个判断,判断此事件与已确定为告警事件的事件是否所属于同一资源的同一监控项,或与已确定为告警事件的事件是否所属于同一资源底层上报事件。可见,在两种告警事件类型中分别判断,若是,则删除对应的已确定为告警事件的事件,也就是说, 当前事件已经解除了告警,没有必要再对其进行记录。
本实施例所提供的,在当前事件不满足告警条件时,对当前事件判断是否与已确定为告警事件的事件是否所属于同一资源的同一监控项,或与已确定为告警事件的事件是否所属于同一资源底层上报事件,避免了已经成为告警事件的事件已经消除告警,但仍被记录的可能,提高了记录告警事件的准确性。
在上述实施例的基础上,对于不满足告警条件的事件,但此事件与已确定为告警事件的实际属于同一资源的同一监控项时,还需要判断监控数值连续低于阈值的次数是否超过预设次数。
值得注意的是,可能会出现监控数值跳变,具体的为,在01.00时刻大于阈值,02.00时刻低于阈值又在03.00时刻大于阈值,为了防止这种情况的发生,防止监控数值跳变带来的不稳定,因此在连续多次出现监控数值低于阈值时,才认为告警被修复,才删除对应的告警事件。此外,本实施例对于预设次数不作限定,可以根据具体的实施情况对预设次数进行选择,另外,本实施例仅仅提供一种优选的实施方式,但不限于这一种判定方式,也可以根据监控数值低于阈值的连续时间来判定,本实施例不再赘述。
此外,对应的,不满足告警条件的事件,但此事件与已确定为告警事件的事件属于同一资源底层上报事件,还需要将累计上报次数清零,值得注意的是,对于底层上报事件,每发生一次告警就会对当前事件的告警次数记录,而若当前事件不满足告警条件但与已确定为告警事件的事件属于同一资源底层上报事件,则代表此事件已经消除报警,因此需要将上报次数清零。
本实施例所提供的不满足告警条件的事件,但与已确定为告警事件的实际属于同一资源的同一监控项时,对监控数值低于阈值的次数进行判断,避免了监控数值跳变带来的不稳定,也避免了对告警事件误删的情况发生,提高了确定告警事件的准确率。
在上述实施例的基础上,告警参数还包括自定义告警系数,并对优先级的设定进行限定,具体如下:
对于阈值类告警事件对应的优先级是通过当前事件的目标使用频率、第一权 重和自定义告警系数的乘积作为当前事件的优先级。而对于事件类告警事件对应的优先级是通过当前事件的目标使用频率、第二权重和自定义告警系数的乘积作为当前事件的优先级。
值得注意的是,自定义告警系数是根据用户设定,可以对使用较多的器件设定较大的数,对于使用较少的器件设定较少的数。此外,本实施例是根据三者的乘积确定出事件的优先级,但不止于这一种方法,也可以对三者加上对应的权重,最后也可以加上权重后的三个信息的乘积作为当前事件的优先级。
本实施例所提供的告警参数还包括自定义告警系数,且自定义告警系数是通过用户设定,对于获取的是监控数值的事件通过自定义告警系数、第一权重和目标使用频率来确定优先级,获取的是底层上报事件的实际通过自定义告警系数、第二权重和目标使用频率来确定优先级,确保了每个告警事件的公平性,也可以根据个人爱好对个别待监控资源设定较高的自定义告警参数,提高了用户的体验感。
此外,为了让本领域技术人员更加清楚的理解本申请提供的技术方案,参考图3。图3为本申请实施例提供的一种告警事件的处理方法对应的功能模块示意图,如图3所示,参与告警事件的处理方法的功能模块包括告警排序装置1、时空优先级评估装置2、资源忙闲度时刻分布表3、告警事件上报装置4、监控阈值研判装置5和资源监控存储介质6。
其中,这些装置都是通过编程实现,是为了更好的理解本申请提供的一种告警事件的处理方法,资源监控存储介质6是按照固定的采样周期,将所有待监控资源的监控记录保存下来,并不断地更新最新的数据以及清除最早的数据,而监控阈值研判装置5是不断读取监控存储介质,比较告警事件的监控数值和阈值的差值,也就是上述实施例提到阈值类告警事件的判定方法。而告警事件上报装置4是负责搜集资源底层发送的各类事件,将与告警相关的事件筛选处理上报系统。此外,资源忙闲度时刻分布表3通过线性拟合算法,根据近段时间的历史资源忙闲度时刻分布数据,推算出当天不同时刻系统中各个资源的使用频率,时空优先级评估装置2根据告警的时空数据代入参数模型对告警的重要程度进行评分,并动态地根据最新的监控数值和事件上报情况进行更新,告警排序装置1将告警按 照时空优先级评估装置所评分数的高低进行排序。
在上述实施例中,对于告警事件的处理方法进行了详细描述,本申请还提供告警事件的处理装置对应的实施例。需要说明的是,本申请从两个角度对装置部分的实施例进行描述,一种是基于功能模块的角度,另一种是基于硬件的角度。
由于装置部分的实施例与方法部分的实施例相互对应,因此装置部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。
图4为本申请实施例提供的一种告警事件的处理装置的结构图,如图4所示,告警事件的处理装置包括:
第一获取模块10,用于获取待监控资源在运行过程中产生的事件。
判断模块11,用于判断所获取的事件是否满足告警条件;若是,则触发第二获取模块。
第二获取模块12,用于获取对满足告警条件的告警事件所设置的告警参数,其中,告警参数至少包括与告警事件对应的待监控资源在当前时间段对应的目标使用频率,以及用于表征告警事件的严重程度的影响因子。
确定模块13,用于根据所得到的告警参数确定所得到的告警事件的优先级以便于在接收到用户查看请求时按照所确定的优先级顺序展示。
图5为本申请另一实施例提供的告警事件的处理装置的结构图,如图5所示,告警事件的处理装置包括:存储器20,用于存储计算机可读指令;
处理器21,用于执行计算机可读指令时实现如上述实施例中所提到的告警事件的处理方法的步骤。
本实施例提供的告警事件的处理装置可以包括但不限于智能手机、平板电脑、笔记本电脑或台式电脑等。
其中,处理器21可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器21可以采用数字信号处理器(Digital Signal Processor,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。
处理器21也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态 下的数据进行处理的处理器,也称中央处理器;协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器21可以在集成有GPU,GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器21还可以包括人工智能(Artificial Intelligence,AI)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器20可以包括一个或多个非易失性计算机可读存储介质,该非易失性计算机可读存储介质可以是非暂态的。存储器20还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。本实施例中,存储器20至少用于存储以下计算机可读指令201,其中,该计算机可读指令被处理器21加载并执行之后,能够实现前述任意一个实施例公开的告警事件的处理方法的相关步骤。另外,存储器20所存储的资源还可以包括操作系统202和数据203等,存储方式可以是短暂存储或者永久存储。其中,操作系统202可以包括Windows、Unix、Linux等。数据203可以包括但不限于告警事件的处理方法的数据等。
在一些实施例中,告警事件的处理装置还可包括有显示屏22、输入输出接口23、通信接口24、电源25以及通信总线26。
本领域技术人员可以理解,图5中示出的结构并不构成对告警事件的处理装置的限定,可以包括比图示更多或更少的组件。
最后,本申请还提供一种非易失性计算机可读存储介质对应的实施例。非易失性计算机可读存储介质上存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时实现如上述方法实施例中记载的步骤。
可以理解的是,如果上述实施例中的方法以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程 序代码的介质。
以上对本申请所提供的告警事件的处理方法、装置及非易失性计算机可读存储介质进行了详细介绍。说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (16)

  1. 一种告警事件的处理方法,其特征在于,包括:
    获取待监控资源在运行过程中产生的事件;
    判断所获取的所述事件是否满足告警条件;
    响应于所获取的所述事件满足告警条件,获取对满足所述告警条件的告警事件所设置的告警参数,其中,所述告警参数至少包括与所述告警事件对应的,所述待监控资源在当前时间段对应的目标使用频率,以及用于表征所述告警事件的严重程度的影响因子;及
    根据所得到的所述告警参数确定所得到的所述告警事件的优先级,以便于在接收到用户查看请求时按照所确定的所述优先级顺序展示。
  2. 根据权利要求1所述的告警事件的处理方法,其特征在于,所述获取待监控资源在运行过程中产生的事件包括:获取所述待监控资源的待监控项的监控数值和/或资源底层上报事件;
    响应于获取到的是所述监控数值,所述判断所获取的所述事件是否满足告警条件包括:判断所述监控数值是否超过阈值,响应于所述监控数值超过阈值,确定所述事件满足所述告警条件,响应于所述监控数值未超过阈值,确定所述事件不满足所述告警条件;
    响应于获取到的是所述资源底层上报事件,所述判断所获取的所述事件是否满足告警条件包括:判断所述资源底层上报事件是否为告警黑名单上设定的事件,响应于所述资源底层上报事件为告警黑名单上设定的事件,确定所述事件满足所述告警条件,响应于所述资源底层上报事件不为告警黑名单上设定的事件,确定所述事件不满足所述告警条件。
  3. 根据权利要求2所述的告警事件的处理方法,其特征在于,若获取到的是所述监控数值且所述事件为所述告警事件,则所述用于表征所述告警事件的严重程度的影响因子为:所述监控数值与阈值的差值所对应的第一权重;其中,所述差值与所述第一权重呈正相关关系。
  4. 根据权利要求2所述的告警事件的处理方法,其特征在于,若获取到的是所述资源底层上报事件且所述事件为所述告警事件,则所述用于表征所述告警事件的严重程度的影响因子为:所述告警事件的累积上报次数所对应的第二权重; 其中,所述累积上报次数与所述第二权重呈正相关关系。
  5. 根据权利要求1所述的告警事件的处理方法,其特征在于,所述目标使用频率通过如下方式确定:
    获取发生所述告警事件所属的目标时间段以及所述告警事件对应的目标资源;及
    在包含有各资源、各时间段和各使用频率的对应关系中,选取与所述目标时间段和所述目标资源对应的使用频率作为所述目标使用频率。
  6. 根据权利要求1或5所述的告警事件的处理方法,其特征在于,所述与所述告警事件对应的待监控资源在各个时间段对应的使用频率都有记录,所述目标使用频率是指待监控资源发生告警事件的那段时间的使用频率。
  7. 根据权利要求5所述的告警事件的处理方法,其特征在于,所述包含有各资源、各时间段和各使用频率的对应关系是通过线性拟合算法,对各资源在各时间段内的历史使用频率所确定的。
  8. 根据权利要求2所述的告警事件的处理方法,其特征在于,若所获取的所述事件满足所述告警条件,在所述获取对满足所述告警条件的告警事件设置的告警参数的步骤之前,还包括:
    判断当前的所述告警事件与已确定为所述告警事件的事件,是否属于同一资源的同一监控项,或与已确定为所述告警事件的事件是否属于同一资源底层上报事件;及
    响应于判断结果为是,删除对应的已确定为所述告警事件的事件,并进入所述获取对满足所述告警条件的告警事件所设置的告警参数的步骤。
  9. 根据权利要求2所述的告警事件的处理方法,其特征在于,所述方法还包括:
    响应于所获取的所述事件不满足所述告警条件,判断当前不满足所述告警条件的事件与已确定为所述告警事件的事件,是否属于同一资源的同一监控项,或与已确定为所述告警事件的事件是否属于同一资源底层上报事件;及
    响应于判断结果为是,删除对应的已确定为所述告警事件的事件。
  10. 根据权利要求9所述的告警事件的处理方法,其特征在于,响应于当前的不满足所述告警条件的事件与已确定为所述告警事件的事件,属于同一资源的 同一监控项,在所述删除对应的已确定为所述告警事件的事件步骤之前,还包括:
    记录属于同一待监控资源的,同一待监控项的所述监控数值连续低于所述阈值的次数;
    判断所述次数是否超过预设次数;及
    响应于所述次数超过预设次数,进入所述删除对应的,已确定为所述告警事件的事件步骤。
  11. 根据权利要求4所述的告警事件的处理方法,其特征在于,所述方法还包括:
    响应于所获取的所述事件不满足所述告警条件,判断不满足所述告警条件的事件与已确定为所述告警事件的事件,是否属于同一资源底层上报事件;及
    响应于所述告警条件的事件与已确定为所述告警事件的事件,属于同一资源底层上报事件,将所述累积上报次数清零。
  12. 根据权利要求3或4所述的告警事件的处理方法,其特征在于,所述告警参数还包括自定义告警系数,对应的,所述根据所得到所述告警参数确定所得到的所述告警事件的优先级包括:
    将所述目标使用频率、第一权重、所述自定义告警系数的乘积作为所得到的所述告警事件的优先级。
  13. 根据权利要求3或4所述的告警事件的处理方法,其特征在于,所述告警参数还包括自定义告警系数,对应的,所述根据所得到所述告警参数确定所得到的所述告警事件的优先级还包括:
    将所述目标使用频率、第二权重、所述自定义告警系数的乘积作为所得到的所述告警事件的优先级。
  14. 一种告警事件的处理装置,其特征在于,包括:
    第一获取模块,用于获取待监控资源在运行过程中产生的事件;
    判断模块,用于判断所获取的所述事件是否满足告警条件;响应于所述事件满足告警条件,触发第二获取模块;
    所述第二获取模块,用于获取对满足所述告警条件的告警事件所设置的告警参数,其中,所述告警参数至少包括与所述告警事件对应的,所述待监控资源在当前时间段对应的目标使用频率,以及用于表征所述告警事件的严重程度的影响 因子;及
    确定模块,用于根据所得到的所述告警参数,确定所得到的所述告警事件的优先级,以便于在接收到用户查看请求时按照所确定的所述优先级顺序展示。
  15. 一种告警事件的处理装置,其特征在于,包括存储器,用于存储计算机可读指令;
    一个或多个处理器,用于执行所述计算机可读指令时实现如权利要求1至13任一项所述的告警事件的处理方法的步骤。
  16. 一种非易失性计算机可读存储介质,其特征在于,所述非易失性计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时实现如权利要求1至13任一项所述的告警事件的处理方法的步骤。
PCT/CN2022/115339 2022-01-21 2022-08-28 一种告警事件的处理方法、装置及计算机可读存储介质 WO2023138058A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210073375.5A CN114443429B (zh) 2022-01-21 2022-01-21 一种告警事件的处理方法、装置及计算机可读存储介质
CN202210073375.5 2022-01-21

Publications (1)

Publication Number Publication Date
WO2023138058A1 true WO2023138058A1 (zh) 2023-07-27

Family

ID=81367483

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115339 WO2023138058A1 (zh) 2022-01-21 2022-08-28 一种告警事件的处理方法、装置及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114443429B (zh)
WO (1) WO2023138058A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116962080A (zh) * 2023-09-19 2023-10-27 中孚信息股份有限公司 基于网络节点风险评估的告警过滤方法、系统及介质
CN117554385A (zh) * 2023-11-02 2024-02-13 上海感图网络科技有限公司 同向连续报废的允收告警方法、装置、设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114443429B (zh) * 2022-01-21 2024-05-28 苏州浪潮智能科技有限公司 一种告警事件的处理方法、装置及计算机可读存储介质
CN115562894A (zh) * 2022-12-07 2023-01-03 云账户技术(天津)有限公司 消息处理的方法、装置、电子设备及可读存储介质

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090193436A1 (en) * 2008-01-30 2009-07-30 Inventec Corporation Alarm display system of cluster storage system and method thereof
CN102034148A (zh) * 2010-12-08 2011-04-27 山东浪潮齐鲁软件产业股份有限公司 一种监控系统的事件预警及防风暴策略的实现方法
US20130290783A1 (en) * 2012-04-27 2013-10-31 General Instrument Corporation Estimating a Severity Level of a Network Fault
US8738972B1 (en) * 2011-02-04 2014-05-27 Dell Software Inc. Systems and methods for real-time monitoring of virtualized environments
CN104750596A (zh) * 2013-12-30 2015-07-01 中国移动通信集团公司 一种告警信息处理方法及服务子系统
CN106844165A (zh) * 2016-12-16 2017-06-13 华为技术有限公司 告警方法及装置
US20180341566A1 (en) * 2017-05-24 2018-11-29 Vmware, Inc. Methods and systems to prioritize alerts with quantification of alert impacts
CN109284215A (zh) * 2018-09-20 2019-01-29 郑州云海信息技术有限公司 一种数据中心的监控平台的告警方法和装置
CN111782462A (zh) * 2020-06-13 2020-10-16 华青融天(北京)软件股份有限公司 告警方法、装置和电子设备
CN114443429A (zh) * 2022-01-21 2022-05-06 苏州浪潮智能科技有限公司 一种告警事件的处理方法、装置及计算机可读存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106341253A (zh) * 2015-07-13 2017-01-18 中兴通讯股份有限公司 一种告警管理方法及装置、通信系统
CN108920098A (zh) * 2018-06-20 2018-11-30 郑州云海信息技术有限公司 一种存储管理系统收集信息的方法、系统及设备
CN110875841A (zh) * 2018-09-04 2020-03-10 广东神马搜索科技有限公司 报警信息的推送方法、装置及可读存储介质
CN109559018A (zh) * 2018-10-31 2019-04-02 中国石油天然气集团有限公司 一种报警等级的评估方法及系统
CN110365642B (zh) * 2019-05-31 2022-06-03 平安科技(深圳)有限公司 监控信息操作的方法、装置、计算机设备及存储介质
CN110704283A (zh) * 2019-09-05 2020-01-17 北京浪潮数据技术有限公司 一种统一生成告警信息的方法、装置和介质
CN112714030B (zh) * 2021-03-24 2021-06-22 腾讯科技(深圳)有限公司 告警方法、装置、设备及计算机可读存储介质
CN113835916A (zh) * 2021-08-31 2021-12-24 济南浪潮数据技术有限公司 一种基于Ambari大数据平台的告警方法、系统及设备

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090193436A1 (en) * 2008-01-30 2009-07-30 Inventec Corporation Alarm display system of cluster storage system and method thereof
CN102034148A (zh) * 2010-12-08 2011-04-27 山东浪潮齐鲁软件产业股份有限公司 一种监控系统的事件预警及防风暴策略的实现方法
US8738972B1 (en) * 2011-02-04 2014-05-27 Dell Software Inc. Systems and methods for real-time monitoring of virtualized environments
US20130290783A1 (en) * 2012-04-27 2013-10-31 General Instrument Corporation Estimating a Severity Level of a Network Fault
CN104750596A (zh) * 2013-12-30 2015-07-01 中国移动通信集团公司 一种告警信息处理方法及服务子系统
CN106844165A (zh) * 2016-12-16 2017-06-13 华为技术有限公司 告警方法及装置
US20180341566A1 (en) * 2017-05-24 2018-11-29 Vmware, Inc. Methods and systems to prioritize alerts with quantification of alert impacts
CN109284215A (zh) * 2018-09-20 2019-01-29 郑州云海信息技术有限公司 一种数据中心的监控平台的告警方法和装置
CN111782462A (zh) * 2020-06-13 2020-10-16 华青融天(北京)软件股份有限公司 告警方法、装置和电子设备
CN114443429A (zh) * 2022-01-21 2022-05-06 苏州浪潮智能科技有限公司 一种告警事件的处理方法、装置及计算机可读存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116962080A (zh) * 2023-09-19 2023-10-27 中孚信息股份有限公司 基于网络节点风险评估的告警过滤方法、系统及介质
CN116962080B (zh) * 2023-09-19 2023-12-15 中孚信息股份有限公司 基于网络节点风险评估的告警过滤方法、系统及介质
CN117554385A (zh) * 2023-11-02 2024-02-13 上海感图网络科技有限公司 同向连续报废的允收告警方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN114443429A (zh) 2022-05-06
CN114443429B (zh) 2024-05-28

Similar Documents

Publication Publication Date Title
WO2023138058A1 (zh) 一种告警事件的处理方法、装置及计算机可读存储介质
US10972344B2 (en) Automated adjustment of subscriber policies
US11212208B2 (en) Adaptive metric collection, storage, and alert thresholds
CN106961352B (zh) 监控系统及监控方法
US9584617B2 (en) Allocating cache request in distributed cache system based upon cache object and marker identifying mission critical data
US9292407B2 (en) System and method for adaptively collecting performance and event information
CN105471671A (zh) 一种云平台资源自定义监控规则的方法
US9027025B2 (en) Real-time database exception monitoring tool using instance eviction data
US10896073B1 (en) Actionability metric generation for events
CN110955586A (zh) 一种基于日志的系统故障预测方法、装置和设备
CN114091704B (zh) 一种告警压制方法和装置
CN110321364B (zh) 信用卡管理系统的交易数据查询方法、装置及终端
CN111339466A (zh) 接口管理方法、装置、电子设备及可读存储介质
CN111782488B (zh) 消息队列监控方法、装置、电子设备和介质
CN114490160A (zh) 一种数据倾斜优化因子自动调整方法、装置、设备和介质
CN117472652A (zh) 一种云计算运维平台的数据备份方法、装置及系统
US10223189B1 (en) Root cause detection and monitoring for storage systems
CN108255710B (zh) 一种脚本的异常检测方法及其终端
CN115718732A (zh) 一种磁盘文件管理方法、装置、设备及存储介质
CN116436821A (zh) 一种基于人工智能计算平台的运维管理软件系统
CN112905119B (zh) 一种分布式存储系统的数据写入控制方法、装置及设备
CN110493071B (zh) 消息系统资源均衡装置、方法及设备
US9898357B1 (en) Root cause detection and monitoring for storage systems
KR20180047079A (ko) 모니터링 결과의 이벤트 등급 결정 방법 및 장치
CN115114133B (zh) 基于java的系统自适应限流方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921486

Country of ref document: EP

Kind code of ref document: A1