CN117806916A - Multi-unit server lightweight alarm correlation mining and converging method and system - Google Patents

Multi-unit server lightweight alarm correlation mining and converging method and system Download PDF

Info

Publication number
CN117806916A
CN117806916A CN202410230348.3A CN202410230348A CN117806916A CN 117806916 A CN117806916 A CN 117806916A CN 202410230348 A CN202410230348 A CN 202410230348A CN 117806916 A CN117806916 A CN 117806916A
Authority
CN
China
Prior art keywords
alarm
new
convergence
alarms
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410230348.3A
Other languages
Chinese (zh)
Inventor
袁远
周桐庆
王俊
邢建英
李志星
谢徐超
宋振龙
魏登萍
张根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202410230348.3A priority Critical patent/CN117806916A/en
Publication of CN117806916A publication Critical patent/CN117806916A/en
Pending legal-status Critical Current

Links

Landscapes

  • Alarm Systems (AREA)

Abstract

The invention discloses a multi-unit server lightweight alarm correlation mining and converging method and system, wherein the method firstly provides an alarm deduplication method based on level correlation for deduplicating alarms; the structure of the alarm convergence support tree is adopted as an adaptation improvement for the alarm convergence of the traditional dictionary tree, and the structure realizes the efficient alarm deduplication and convergence based on the shared memory by adding an alarm time list and an associated alarm pointer linked list. On the basis, an alarm convergence method based on space-time correlation is also provided for mining the support degree of different alarm combinations and screening the alarm association relation based on space-time confidence. The invention aims to realize the aggregation and convergence of massive alarm information based on the alarm time-space correlation, and automatically and efficiently reduce the redundant alarm information of the multi-unit blade server system through the alarm record data structure and alarm association discovery so as to realize the efficient management of the alarm data of the multi-unit server system.

Description

Multi-unit server lightweight alarm correlation mining and converging method and system
Technical Field
The invention relates to the technical field of multi-unit server lightweight fault management or alarm convergence/aggregation, in particular to a multi-unit server lightweight alarm correlation mining and convergence method and system.
Background
Supercomputer centers and data centers typically deploy a large number of high component density blade servers, each rack equipped with high density blade servers containing several tens of computing motherboards, several switching motherboards (business data networks), a monitoring motherboard, and several rack power modules and rack heat dissipation modules (fans), etc., which form a multi-unit server system according to a three-level organization architecture of motherboard-rack-system to provide computing power and services to the outside. The omnibearing state monitoring, configuration control and debugging maintenance capability of the multi-unit server system is an important guarantee for ensuring the stable operation of the system, and a primary main board (a computing main board and an exchange main board) is generally integrated with a board-level management unit (BMU, base Management Unit) which is used for monitoring and managing a single main board; the management unit of the machine frame primary (a plurality of blades) is a CMU (Chassis Management Unit) monitoring main board and operates as an independent plug-in unit, and is responsible for collecting, storing and processing monitoring management information of each server main board BMU; and the system level collects the information and alarms reported by each frame CMU through the SMU, and implements the monitoring management work of the whole system.
In recent years, the integration level of components of the blade server is higher and higher, and the alarm information generated by using dense components as an alarm source has multisource and multistage mixing and content redundancy, so that challenges are presented to the efficient operation and management of the multi-unit server system. On the one hand, the same fault or problem generated by one component can cause a series of chained alarms, so that repeated alarms are accumulated at the CMU; on the other hand, the fault of one component can be generated in the form of a reaction chain, so that a plurality of alarms can be triggered to the CMU in a short time to mask the real important fault. Therefore, if the CMU cannot effectively filter alarms in the frame, the alarm data of all the mainboards in the frame directly occupy the limited local storage space of the CMU, and a large number of alarms of the frame are further reported to the SMU by the CMU, which would certainly consume a large amount of storage and communication resources to record redundant, complex and even invalid (most of which do not include real causes of faults) alarm information belonging to a small number of faults. When the E-level domestic super-computing system with larger regulation and higher complexity is faced, the number of nodes is multiplied (the scale reaches hundreds of thousands of levels, each computing node also comprises tens of even more software and hardware monitoring indexes), the level of a machine frame is multiplied, the problem of redundant alarm is multiplied, the alarm storm of the system is extremely easy to cause, and the difficulty of positioning and processing the server fault by operation and maintenance personnel is aggravated.
Traditional fault management adopts a method of manual reasoning or preset rules. The fault management based on manual reasoning is carried out by carrying out relevance analysis of fault alarms by operation and maintenance personnel in combination with professional knowledge, self experience and preset rules, filtering and screening alarm information in a manual mode, mining fault sources from all alarms, and further carrying out on-site operation and maintenance; the method based on the preset rule combines alarms according to a certain strategy, and generally adopts a time window strategy (for example, how many alarms are sent at most in 5 minutes, and the upper limit of the number of alarms per day). Based on the retrieved alarm duplication elimination reference information retrieval technology, matching the new alarm record with the history alarm data, and if the new alarm record is matched with the history alarm data, considering the current new alarm as a duplicate alarm, and not recording. The core of the method is that the similarity comparison of the alarm texts is carried out by utilizing a natural language processing technology, for example, the structural alarm report comparison is carried out by utilizing BM25F, the technical improvement is mainly realized by combining the contexts of multiple alarms on the basis of text matching for joint comparison or modeling the alarm matching process as a Support Vector Machine (SVM) classification model. The cluster-based fault alarm aggregation utilizes a clustering algorithm to select alarm representatives. The basic process is that after mapping/encoding all alarms to a common feature space, measuring the distance of each encoded alarm by indexes such as cosine similarity, adopting clustering models such as DBSCAN, K mean value, spectral clustering and the like to divide mass alarms into a limited number of alarm clusters, then selecting centroid alarms from each cluster as the representative of all alarms in the cluster, and other alarms in the cluster are regarded as repeated alarms or associated alarms are not reported to a system. An application example based on the clustering method is the concept clustering of alarms, the alarms are divided according to hierarchical attributes (machine room-service pool-machine frame) by using a concept clustering algorithm AOI (Attribute Oriented Induction, based on attribute induction), and the alarms can be abstracted according to different levels (for example, aggregation is carried out according to machine room layers, and only one alarm is reserved). The alarm aggregation based on the supervised coding is used for carrying out repeated or similar alarm discovery by means of a deep learning model. The method extracts alarm characteristics through word embedding, carries out alarm repeatability and similarity classification detection by using models such as a multilayer perceptron (MLP) and a complex model such as a twin convolutional neural network and an LSTM, inputs most of alarm pairs, outputs most of probabilities that the alarm pairs belong to the same class, and finally, the alarms detected as similar can be de-duplicated according to strategies such as time sequence. The depth model needs a large amount of marked sample data for parameter training, and can generally provide outstanding redundant alarm detection precision. In addition, in practice, the alarm can be combined with the corresponding intervention means (automatic capacity expansion and restarting) through an automatic recovery strategy, and the SMU automatically executes the intervention means bound with the alarm when the alarm is found, so that the alarm is directly eliminated, and the manual intervention is reduced. However, the existing method is often limited to the elimination of alarms, but the problem of alarm storm still exists for the alarms of a multi-unit server system with massive alarms.
In summary, existing alarm convergence techniques or methods suffer from the following drawbacks: 1) The traditional method is excessively dependent on self quality and experience of operation and maintenance personnel, can not be effectively expanded to automatic alarm management of a large-scale server, and can not be rapidly adapted to new alarm characteristics in a dynamic scene. Particularly, the method based on manual reasoning has low operation and maintenance efficiency and high operation and maintenance cost. Therefore, the traditional method can be generally used as the last means of alarm deduplication. 2) Based on the grammar or the context similarity of the search alarm deduplication and the clustering alarm aggregation main investigation alarms, the logical relevance between alarms is ignored, for example, the alarm A and the alarm B can be expressed in a text with a large difference, but can be actually alarms of two devices caused by the same fault, so that the two methods are logically related, and obviously, the two methods cannot effectively merge the logically related alarms. 3) The alarm aggregation based on supervised learning requires training a depth model on the labeling data, the labeling and preprocessing cost of the data is high, the parameter training of the depth model can introduce a large amount of calculation cost, and high requirements are also put on running hardware. The computational effort of SMU is typically small compared to the computational module, e.g., SMU typically employs D2000 chips (4 cores, dominant frequency 2 GHZ), configures 16G memory, and has difficulty supporting model training and real-time reasoning tasks. In addition, the depth model has the defect of interpretability, and is not friendly to downstream alarm analysis or root cause positioning tasks. In summary, the existing alarm convergence technology is limited by semantic association mining of alarm analysis of a multi-unit server system and analysis efficiency under the condition of limited hardware resources, and it is difficult to provide efficient alarm association analysis and alarm convergence processing.
Disclosure of Invention
The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a multi-unit server lightweight alarm correlation mining and converging method and a system, which aim to realize the aggregation and convergence of massive alarm information based on alarm time-space correlation, and automatically and efficiently reduce redundant alarm information of a large-scale blade server system through an alarm record data structure and alarm correlation discovery so as to realize the efficient management of alarm data of the multi-unit server system.
In order to solve the technical problems, the invention adopts the following technical scheme:
a multi-unit server lightweight alarm correlation mining and convergence method comprises the following steps:
s101, receiving a new alarm A new
S102, finding the nearest neighbor alarm A with the latest time in an alarm time stamp list ngb
S103, judging a new alarm A new Whether the alarm level of (a) is greater than the nearest neighbor alarm a ngb If a new alarm A new Is greater than nearest neighbor alert a ngb And judging the new alarm A new Nearest neighbor alert A ngb Whether or not the time interval of (2) is within a predetermined time window length T thres In the case that the time interval is within the preset time window length T thres In, deleting nearest neighbor alarm A ngb And record a new alert a new Completing 1 alarm deduplication; otherwise, no alarm duplication removal processing is carried out, and a new alarm A is recorded new If a new alarm A is given new If the alarm level of the system has a timestamp, replacing the alarm of the same alarm level;
the new alarm A new Is used for the alarm level of (a) and the nearest neighbor alarm A ngb The alarm level of the system is recorded through an alarm convergence support tree, the alarm convergence support tree takes all sensors in a multi-unit server system as alarm sources, names are organized according to prefix of physical devices, the generated alarms are binary groups of sensor character strings and alarm levels, and all the sensor character strings can be encoded into a tree structure in a mode that the sensor character strings are distributed from left to right in each layer of dictionary sequence and common father nodes are prefixed to the same sensor; each node is provided with a binary identifier for indicating whether a character string formed from the root node to the current node corresponds to a sensor alarm source, and for each leaf node or internal node identified as the alarm source, a three-level alarm time stamp list is respectively linked, wherein the three-level alarm time stamp list is a three-element list, and three elements record time stamps of the last non-important, important and unrecoverable three alarm levels in sequence, and no alarm corresponding position element is empty.
Step S103 further comprises performing association alert aggregation based on space-time correlation: s201, dividing alarms in an alarm convergence support tree into a plurality of sets according to occurrence time, and regarding the alarms occurring in the same time window as co-occurrence, thereby obtaining a plurality of groups of co-occurrence alarm sets, wherein the alarm sets at least comprise one alarm; s202, calculating the support degree of 1-item sets and the support degree of 2-item sets of different alarms by using an Apriori algorithm, wherein the 1-item sets are single-item alarms, and the 2-item sets are composed of two alarms; s203, calculating the space-time confidence coefficient of each potential alarm association relation according to preset set weights according to the proximity relation of the sensors, the support degree of the 1-item set and the support degree of the 2-item set; s204, deleting the alarm association relation with the space-time confidence coefficient lower than a preset threshold value from the alarm convergence support tree, and updating the alarm convergence support tree.
Optionally, step S103 further includes: if a new alarm A new Alarm level of nearest neighbor alarm A or less ngb And judging the new alarm A new Nearest neighbor alert A ngb Whether or not the time interval of (2) is within a predetermined time window length T thres In the case of neglecting nearest neighbor alarm A ngb Completing 1 alarm de-duplication; otherwise, no alarm duplication removal processing is carried out, and a new alarm A is recorded new If a new alarm A is given new The alarms with the same alarm level are replaced by the time stamp.
Optionally, the alarm convergence support tree is stored in a shared memory to support updating and accessing of structures by different alarm source processing processes.
Optionally, the alarm convergence support tree does not make a full record of alarms, and the callback function is set to record the alarms in the list to the database when the alarm timestamp list is updated or periodically.
Optionally, in step S203, the functional expression for calculating the spatial-temporal confidence coefficient of each potential alarm association relationship is:
in the above-mentioned method, the step of,representing alarm->And->Spatial-temporal confidence of the association of +.>Representing the minimum value +_>For alarm->And->Weights are set for the proximity relations of +.>For alarm->And->2-the support of the item set,for alarm->And->2-support of item set, +.>Representing arbitrary sum->An alarm that an association exists,representing alarm->1-support of item sets.
Optionally, the preset weight set in step S203 according to the proximity relation of the sensor satisfies the following constraint relation: the proximity weight between different sensors of the same device > the proximity weight of the adjacent device sensor > the proximity weight of the spacing device sensor.
In addition, the invention also provides a multi-unit server lightweight alarm correlation mining and convergence system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the multi-unit server lightweight alarm correlation mining and convergence method.
Furthermore, the present invention provides a computer readable storage medium having stored therein a computer program for programming or configuring by a microprocessor to perform the multi-unit server lightweight alarm correlation mining and convergence method.
Furthermore, the present invention provides a computer program product comprising a computer program/instructions programmed or configured to execute the multi-unit server lightweight alarm correlation mining and convergence method by a processor.
Compared with the prior art, the invention has the following advantages: the invention can implement high-efficiency filtering convergence on massive alarm data generated by a multi-unit server system (super computing center or data center) under the conditions of light operation and maintenance manpower investment and limited hardware resource allocation. By eliminating frame-level redundant alarms, mining the space-time association relation of the alarms, carrying out alarm data aggregation, controllably refining the highlight alarms for system reporting, thereby reducing the workload and the work difficulty of operation and maintenance personnel and lowering the operation and maintenance cost.
Drawings
FIG. 1 is a schematic diagram of a redundant alarm deduplication process in an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of an alarm convergence support tree according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of association alarm aggregation based on spatio-temporal correlation in an embodiment of the present invention.
Detailed Description
As shown in fig. 1, the multi-unit server lightweight alarm correlation mining and convergence method of the present embodiment includes:
s101, receiving a new alarm A new
S102, finding the nearest neighbor alarm A with the latest time in an alarm time stamp list ngb
S103, judging a new alarm A new Whether the alarm level of (a) is greater than the nearest neighbor alarm a ngb If a new alarm A new Is greater than nearest neighbor alert a ngb And judging the new alarm A new Nearest neighbor alert A ngb Whether or not the time interval of (2) is within a predetermined time window length T thres In the case that the time interval is within the preset time window length T thres In, deleting nearest neighbor alarm A ngb And record a new alert a new Completing 1 alarm deduplication (alarm number-1); otherwise, no alarm duplication removal processing is carried out, and a new alarm A is recorded new If a new alarm A is given new The alarms with the same alarm level are replaced by the time stamp.
As an alternative implementation manner, as shown in fig. 1, step S103 in this embodiment further includes: if a new alarm A new Alarm level of nearest neighbor alarm A or less ngb And judging the new alarm A new Nearest neighbor alert A ngb Whether or not the time interval of (2) is within a predetermined time window length T thres In the case of neglecting nearest neighbor alarm A ngb Complete 1 alarm duplication elimination (alarm number-1); otherwise, no alarm duplication removal processing is carried out, and a new alarm A is recorded new If a new alarm A is given new The alarms with the same alarm level are replaced by the time stamp.
In operation of a multi-unit server systemThe chassis management unit CMU collects BMU (board level management unit) alarm data through SNMP Trap. In operation and maintenance practice, it is found that: when the board-level management unit BMU detects that a certain sensor is in the nr state of the highest level, non-critical (nc) critical, cr critical and non-recovery (nr) alarms are triggered sequentially from low to high, and the time of 3 alarms is often not very different (empirically observed to be within 1 second); similarly, for alarms in the cr state, both nc and cr alarms are triggered simultaneously. The redundant alarm condition is caused by the reporting mechanism of the board management unit BMU, obviously, only the highest-level alarm needs to be reported, and the low-level alarm triggered synchronously is redundant and needs to be removed. Due to the low computational power of the board-level management unit BMU, it is not practical to rely on the board-level management unit BMU to perform redundant alarm deduplication. In this embodiment, by using correlations of non-critical, and unrecoverable (nr) three-level alarms, efficient deduplication is performed by maintaining a three-level alarm timestamp list of a security management unit SMU alarm convergence support tree, and setting a time correlation window T thres The goal is to only reserve 1 highest-level alarm for the same window, thereby achieving the purpose of alarm convergence.
In this embodiment, a new alert A new Is based on the alarm level of (a) and nearest neighbor alarms a ngb The alarm level of (1) is recorded by an alarm convergence support tree; after the structure update is completed in step S103, the fault database is written in to support the front-end real-time update alarm. As shown in FIG. 2, the alarm convergence support tree uses all sensors in the multi-unit server system as alarm sources, organizes and names (e.g., frame 1-blade 2-HDD-temperature) by prefix according to the physical devices of the sensors, generates alarms as binary groups of sensor character strings and alarm levels, and encodes all sensor character strings into a tree structure in a manner that each layer of dictionary is arranged from left to right, the same sensor prefix has a common father node, each node has a binary identification for indicating whether the character string formed from the root node to the current node corresponds to one sensorFor each leaf node or internal node identified as an alarm source, a three-level alarm time stamp list is respectively linked, the three-level alarm time stamp list is a three-element list, and three elements record the time stamps (respectively named as timestamp_ lastNC, timeStamp _ lastCR, timeStamp _lastNR) of the last non-important (nc) important (cr) and non-recovery (nr) alarm levels in sequence, and the alarm removal method based on the level correlation shown in fig. 1 is adopted for dynamic maintenance, wherein no alarm corresponding position element is empty. Referring to fig. 2, it can be seen that the alarm convergence support tree includes a unique root node, under which a plurality of chassis nodes 1 to 3 (chassis 1 to chassis 3) are included, under which a plurality of server nodes (including server node 1 and server node 2, etc.) are included, and taking server node 2 as an example, prefix organization names include a hard disk (HDD) and a CPU, and each prefix organization name includes a plurality of sensors that are alarm sources, for example, sensor1 (sensor 1) and sensor2 (sensor 2), etc.
In this embodiment, the alarm convergence support tree is stored in the shared memory to support the update and access of different alarm source processing processes to the structure. The alarm convergence support tree is maintained in the shared memory, supports the update and access of different alarm source processing processes to the structure, accelerates the alarm convergence process by completing alarm association analysis at the memory level, and reduces the storage access and resource expenditure.
In this embodiment, the alarm convergence support tree does not make a full record of alarms, and sets a callback function to record the alarms in the list to the database when the alarm timestamp list is updated or periodically.
Referring to fig. 2, it can be seen that the alarm convergence support tree in this embodiment links a relevant alarm source pointer list for each leaf node or internal node identified as an alarm source, respectively, for supporting the alarm history and the association record of efficient alarm convergence. The related alarm source pointer linked list is used for recording the address of an alarm time stamp list of related alarm sources discovered according to the 'related alarm aggregation based on space-time correlation'. For example, if CPU-sensor1 is identified by the method as a pre-associated alarm for HDD-sensor2, then a pointer to the CPU-sensor1 alarm timestamp list is incremented in the associated alarm source pointer list for HDD-sensor 2. Through the pointer linked list, when the HDD-sensor2 alarm is received, the associated alarm can be accessed by the time cost of O (1), whether the front alarm exists or not is confirmed, and the combination of a plurality of associated alarms is completed. The linked list structure is adopted to facilitate the addition and deletion of the recorded association address based on the newly discovered association alarm.
Alarms from different sensors have a 'cause-effect' relationship and often appear in two aspects, namely co-occurrence (time) of alarms and spatial adjacency (space) of alarm sources. Step S103 of this embodiment further includes performing association alarm aggregation based on spatio-temporal correlation:
s201, dividing alarms in an alarm convergence support tree into a plurality of sets according to occurrence time, and regarding the alarms occurring in the same time window as co-occurrence, thereby obtaining a plurality of groups of co-occurrence alarm sets, wherein the alarm sets at least comprise one alarm;
s202, calculating the support degree of 1-item sets and the support degree of 2-item sets of different alarms by using an Apriori algorithm, wherein the 1-item sets are single-item alarms, and the 2-item sets are composed of two alarms;
s203, calculating the space-time confidence coefficient of each potential alarm association relation according to preset set weights according to the proximity relation of the sensors, the support degree of the 1-item set and the support degree of the 2-item set;
s204, deleting the alarm association relation with the space-time confidence coefficient lower than a preset threshold value from the alarm convergence support tree, and updating the alarm convergence support tree.
In step S201 of the present embodiment, the time window T is set thres Alarms can be divided into a plurality of sets according to occurrence time, and alarms occurring in the same time window are regarded as co-occurrence. As shown in Table 1, 8 sets of co-occurrence alarms are shown { A } 1, A 2, A 3, A 5 },{A 1, A 2, A 4 },{ A 5 },{A 2, A 4 },{A 1, A 2, A 4 },{A 1, A 2, A 3 },{A 1, A 2, A 3, A 4 },{A 4 }。
Table 1 alarm co-occurrence examples for time window statistics.
The Apriori algorithm is a classical data mining algorithm that mines frequent item sets and association rules. As shown in fig. 3, step S202 of this embodiment calculates the support degree of 1-item sets and the support degree of 2-item sets (the number of times the item sets appear in all time windows) of different alarms based on the division of step S201 by using Apriori algorithm, wherein the 1-item sets are single alarms and the 2-item sets are composed of two alarms. The present embodiment method only calculates 2-item sets and not higher item sets, since many-to-one alert associations (e.g., A 1 A 2 A 3 ) Must implicate a number of one-to-one alarms (e.g., A 1 A 3 And A 2 A 3 ) So can be calculated by 2-term sets; while one-to-many alarms (e.g. a 1 A 2 A 3 ) Dynamic query and merge not applicable to alarms (A 2 Synchronous A is also needed when the generation 3 The latter is likely not to occur or to occur at A 2 After that). The calculation process firstly designates the minimum support (for example, 3), and then performs the steps of 2 times of connection (setting of item sets or dictionary sequence combination) and pruning (deleting the item sets smaller than the minimum support), so as to obtain the 2-item set support.
In this embodiment, the functional expression for calculating the spatial-temporal confidence coefficient of each potential alarm association relationship in step S203 is:
in the above-mentioned method, the step of,representing alarm->And->Spatial-temporal confidence of the association of +.>Representing the minimum value +_>For alarm->And->Weights are set for the proximity relations of +.>For alarm->And->2-the support of the item set,for alarm->And->2-support of item set, +.>Representing arbitrary sum->An alarm that an association exists,representing alarm->Support of 1-item set of (2). If alarm A j At alarm A i The occurrence of time windows is more, and the space of the alarm sensors of the two are similar, A is based on the current statistics i Is A j The higher the confidence of the previous cause. The CMU can screen out the association relation pair meeting the empirical confidence requirement by setting a confidence threshold. As shown in fig. 3, a total of 3 pairs of associations are filtered out. The association is not symmetrical, which also accords with the alarm rule of the previous cause-result.
The preset weight set in step S203 according to the proximity relation of the sensor satisfies the following constraint relation: proximity weighting between different sensors of the same device>Proximity weighting of adjacent device sensors>Proximity weights of the spacer sensor. In this embodiment, the security management unit SMU synchronously maintains a predefined alert source spatial association diagram, as shown in fig. 3. The undirected full connection diagram sets a weight S according to the proximity relation of the sensor ij . Example A of the drawings 1 And A is a 2 The same device is assigned a weight of 1, and the weights of adjacent and spacing devices are set to 0.9 and 0.8, respectively. In this embodiment, the association relationship obtained by mining is recorded in the pointer linked list of the result sensor node in the alarm convergence support tree according to the pointer form. With the accumulation of more alarm records or the adjustment of support parameters, the above steps can be repeated to update the alarm association relationship and the records in the tree structure thereof. When an alarm arrives, the CMU finds the alarm source node in the alarm convergence support tree, accesses the addresses in the associated alarm pointer linked list of the node one by one, confirms whether alarms in the same time window are recorded in the alarm time list corresponding to each previous address, if yes, the newly arrived alarms are merged and not recorded, otherwise, the alarm record is added to the node.
In summary, the method of the embodiment firstly provides an alarm deduplication method based on level correlation, which supports the CMU to dynamically deduplicate redundancy of alarms reported by the BMU; and further provides an alarm convergence supporting tree structure, which is used as an adaptation improvement for the alarm convergence of the traditional dictionary tree, and the structure realizes the efficient alarm deduplication and convergence based on the shared memory by adding an alarm time list and an associated alarm pointer linked list, wherein the time cost of deduplication and convergence is O (1). On the basis, an alarm convergence method based on space-time correlation is also provided, the support degree of different alarm combinations is mined by means of an Apriori algorithm, a space-time confidence assessment index is designed for the first time, and alarm association relations are screened. The method of the embodiment has the following advantages: 1. the alarm management process is efficient: the reality of both methods in this embodiment is based on the proposed alarm convergence support tree, which is maintained in the shared memory and can be accessed by multiple alarm management processes, and the design of the triple element list and the pointer linked list allows the method to complete dynamic alarm duplication elimination and merging with O (1) time complexity, so that the resource-constrained SMU can be effectively supported to flexibly extend the monitoring capability to a larger-scale server system. Meanwhile, the whole process does not need excessive rules and frequent intervention preset by operation and maintenance personnel, and the cost of operation and maintenance manpower is reduced. 2. The method is lightweight and can be explained: the embodiment supports the SMU operation environment without complex depth model training and reasoning, and the operation cost is mainly concentrated in a small amount of memory cost and is used for maintaining an alarm convergence support tree structure; meanwhile, the de-duplication and convergence processes are strictly developed according to the level association and the space-time correlation, and the correlation processes can be synchronously counted into a database or a log, so that operation and maintenance tracing and evaluation are facilitated.
In addition, the embodiment also provides a multi-unit server lightweight alarm correlation mining and convergence system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the multi-unit server lightweight alarm correlation mining and convergence method. Furthermore, the present embodiment also provides a computer readable storage medium having stored therein a computer program for being programmed or configured by a microprocessor to perform the multi-unit server lightweight alarm correlation mining and convergence method. Furthermore, the present embodiment also provides a computer program product comprising a computer program/instructions programmed or configured to execute the multi-unit server lightweight alarm correlation mining and convergence method by a processor.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (9)

1. A multi-element server lightweight alarm correlation mining and convergence method, comprising:
s101, receiving a new alarm A new
S102, finding the nearest neighbor alarm A with the latest time in an alarm time stamp list ngb
S103, judging a new alarm A new Whether the alarm level of (a) is greater than the nearest neighbor alarm a ngb If a new alarm A new Is greater than nearest neighbor alert a ngb And judging the new alarm A new Nearest neighbor alert A ngb Whether or not the time interval of (2) is within a predetermined time window length T thres In the case that the time interval is within the preset time window length T thres In, deleting nearest neighbor alarm A ngb And record a new alert a new Completing 1 alarm deduplication; otherwise, no alarm duplication removal processing is carried out, and a new alarm A is recorded new If a new alarm A is given new If the alarm level of the system has a timestamp, replacing the alarm of the same alarm level;
the new alarm A new Is used for the alarm level of (a) and the nearest neighbor alarm A ngb The alarm level of the system is recorded by an alarm convergence support tree, the alarm convergence support tree takes all sensors in a multi-unit server system as alarm sources, names are organized according to prefix by using physical devices, the generated alarms are binary groups of sensor character strings and alarm levels, all the sensor character strings can be encoded into a tree structure according to the mode that the sensor character strings are distributed from left to right in each layer of dictionary sequence and the same sensor prefix is provided with a common father node, and each node is provided with a binary identifier for being used forIndicating whether a character string formed from a root node to a current node corresponds to a sensor alarm source, respectively linking a three-level alarm time stamp list for each leaf node or internal node identified as the alarm source, wherein the three-level alarm time stamp list is a three-element list, and three elements record time stamps of the last non-important, important and unrecoverable three alarm levels in sequence, and no alarm corresponding position element is empty;
step S103 further comprises performing association alert aggregation based on space-time correlation: s201, dividing alarms in an alarm convergence support tree into a plurality of sets according to occurrence time, and regarding the alarms occurring in the same time window as co-occurrence, thereby obtaining a plurality of groups of co-occurrence alarm sets, wherein the alarm sets at least comprise one alarm; s202, calculating the support degree of 1-item sets and the support degree of 2-item sets of different alarms by using an Apriori algorithm, wherein the 1-item sets are single-item alarms, and the 2-item sets are composed of two alarms; s203, calculating the space-time confidence coefficient of each potential alarm association relation according to preset set weights according to the proximity relation of the sensors, the support degree of the 1-item set and the support degree of the 2-item set; s204, deleting the alarm association relation with the space-time confidence coefficient lower than a preset threshold value from the alarm convergence support tree, and updating the alarm convergence support tree.
2. The multi-unit server lightweight alarm correlation mining and convergence method of claim 1, wherein step S103 further comprises: if a new alarm A new Alarm level of nearest neighbor alarm A or less ngb And judging the new alarm A new Nearest neighbor alert A ngb Whether or not the time interval of (2) is within a predetermined time window length T thres In the case of neglecting nearest neighbor alarm A ngb Completing 1 alarm de-duplication; otherwise, no alarm duplication removal processing is carried out, and a new alarm A is recorded new If a new alarm A is given new The alarms with the same alarm level are replaced by the time stamp.
3. The multi-unit server lightweight alarm correlation mining and convergence method of claim 1 wherein the alarm convergence support tree is stored in a shared memory to support updates and accesses to structures by different alarm source processing processes.
4. A multi-unit server lightweight alarm correlation mining and convergence method as claimed in claim 3 wherein the alarm convergence support tree does not make a full record of alarms by setting a callback function to record alarms in the list to the database at the time of alarm timestamp list update or periodically.
5. The multi-unit server lightweight alarm correlation mining and convergence method of claim 1, wherein the functional expression for calculating the spatio-temporal confidence of each potential alarm association in step S203 is:
in the above-mentioned method, the step of,representing alarm->And->Spatial-temporal confidence of the association of +.>Representing the minimum value +_>For alarm->And->Weights are set for the proximity relations of +.>For alarm->And->2-the support of the item set,for alarm->And->2-support of item set, +.>Representing arbitrary sum->An alarm that an association exists,representing alarm->1-support of item sets.
6. The multi-unit server lightweight alarm correlation mining and convergence method according to claim 5, wherein the preset weight setting according to the proximity relation of the sensor in step S203 satisfies the following constraint relation: the proximity weight between different sensors of the same device > the proximity weight of the adjacent device sensor > the proximity weight of the spacing device sensor.
7. A multi-unit server lightweight alarm correlation mining and convergence system comprising a microprocessor and a memory interconnected, wherein the microprocessor is programmed or configured to perform the multi-unit server lightweight alarm correlation mining and convergence method of any one of claims 1-6.
8. A computer readable storage medium having a computer program stored therein, wherein the computer program is for programming or configuring by a microprocessor to perform the multi-unit server lightweight alarm correlation mining and convergence method of any one of claims 1-6.
9. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions are programmed or configured to execute the multi-unit server lightweight alarm correlation mining and convergence method of any one of claims 1 to 6 by a processor.
CN202410230348.3A 2024-02-29 2024-02-29 Multi-unit server lightweight alarm correlation mining and converging method and system Pending CN117806916A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410230348.3A CN117806916A (en) 2024-02-29 2024-02-29 Multi-unit server lightweight alarm correlation mining and converging method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410230348.3A CN117806916A (en) 2024-02-29 2024-02-29 Multi-unit server lightweight alarm correlation mining and converging method and system

Publications (1)

Publication Number Publication Date
CN117806916A true CN117806916A (en) 2024-04-02

Family

ID=90428253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410230348.3A Pending CN117806916A (en) 2024-02-29 2024-02-29 Multi-unit server lightweight alarm correlation mining and converging method and system

Country Status (1)

Country Link
CN (1) CN117806916A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098175A (en) * 2011-01-26 2011-06-15 浪潮通信信息系统有限公司 Alarm association rule obtaining method of mobile internet
WO2016029570A1 (en) * 2014-08-28 2016-03-03 北京科东电力控制系统有限责任公司 Intelligent alert analysis method for power grid scheduling
CN110399262A (en) * 2019-06-17 2019-11-01 平安科技(深圳)有限公司 O&M monitoring alarm convergence method, device, computer equipment and storage medium
CN111767195A (en) * 2020-09-02 2020-10-13 江苏达科云数据科技有限公司 Intelligent noise reduction processing method for alarm information
CN111880999A (en) * 2020-07-30 2020-11-03 中国人民解放军国防科技大学 High-availability monitoring management device for high-density blade server and redundancy switching method
CN112181712A (en) * 2020-09-28 2021-01-05 中国人民解放军国防科技大学 Method and device for improving reliability of processor core
CN113225337A (en) * 2021-05-07 2021-08-06 广州大学 Multi-step attack alarm correlation method, system and storage medium
CN115776409A (en) * 2023-01-29 2023-03-10 信联科技(南京)有限公司 Industrial network security event basic data directional acquisition method and system
CN116185758A (en) * 2022-12-21 2023-05-30 浪潮云信息技术股份公司 Alarm data convergence method based on sliding window and association rule analysis
CN116527481A (en) * 2022-12-19 2023-08-01 武汉科技大学 Network alarm association rule mining and fault positioning method and system based on statistics
CN117221087A (en) * 2023-10-07 2023-12-12 中国联合网络通信集团有限公司 Alarm root cause positioning method, device and medium
CN117221078A (en) * 2023-09-06 2023-12-12 中国联合网络通信集团有限公司 Association rule determining method, device and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098175A (en) * 2011-01-26 2011-06-15 浪潮通信信息系统有限公司 Alarm association rule obtaining method of mobile internet
WO2016029570A1 (en) * 2014-08-28 2016-03-03 北京科东电力控制系统有限责任公司 Intelligent alert analysis method for power grid scheduling
CN110399262A (en) * 2019-06-17 2019-11-01 平安科技(深圳)有限公司 O&M monitoring alarm convergence method, device, computer equipment and storage medium
CN111880999A (en) * 2020-07-30 2020-11-03 中国人民解放军国防科技大学 High-availability monitoring management device for high-density blade server and redundancy switching method
CN111767195A (en) * 2020-09-02 2020-10-13 江苏达科云数据科技有限公司 Intelligent noise reduction processing method for alarm information
CN112181712A (en) * 2020-09-28 2021-01-05 中国人民解放军国防科技大学 Method and device for improving reliability of processor core
CN113225337A (en) * 2021-05-07 2021-08-06 广州大学 Multi-step attack alarm correlation method, system and storage medium
CN116527481A (en) * 2022-12-19 2023-08-01 武汉科技大学 Network alarm association rule mining and fault positioning method and system based on statistics
CN116185758A (en) * 2022-12-21 2023-05-30 浪潮云信息技术股份公司 Alarm data convergence method based on sliding window and association rule analysis
CN115776409A (en) * 2023-01-29 2023-03-10 信联科技(南京)有限公司 Industrial network security event basic data directional acquisition method and system
CN117221078A (en) * 2023-09-06 2023-12-12 中国联合网络通信集团有限公司 Association rule determining method, device and storage medium
CN117221087A (en) * 2023-10-07 2023-12-12 中国联合网络通信集团有限公司 Alarm root cause positioning method, device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
柴思跃;苏奋振;周成虎;: "基于周期表的时空关联规则挖掘方法与实验", 地球信息科学学报, no. 04, 31 August 2011 (2011-08-31), pages 455 - 464 *
郑明玲;蒋句平;袁远;李宝峰;: "一种面向大规模计算机的监控管理系统", 湖南大学学报(自然科学版), no. 04, 25 April 2015 (2015-04-25), pages 107 - 113 *

Similar Documents

Publication Publication Date Title
US9984128B2 (en) Managing site-based search configuration data
US9124612B2 (en) Multi-site clustering
CN111612041B (en) Abnormal user identification method and device, storage medium and electronic equipment
US20200184272A1 (en) Framework for building and sharing machine learning components
CN107003935A (en) Optimize database duplicate removal
KR101948634B1 (en) Failure prediction method of system resource for smart computing
JP2017037648A (en) Hybrid data storage system, method, and program for storing hybrid data
EP3759604A2 (en) Systems and methods for performing a database backup for repairless restore
CN112306787B (en) Error log processing method and device, electronic equipment and intelligent sound box
CN112711591A (en) Data blood margin determination method and device based on field level of knowledge graph
WO2022095637A1 (en) Fault log classification method and system, and device and medium
US20230359633A1 (en) Processing variable-length fields via formatted record data
CN112148578A (en) IT fault defect prediction method based on machine learning
CN114327964A (en) Method, device, equipment and storage medium for processing fault reasons of service system
US10574552B2 (en) Operation of data network
CN116795977A (en) Data processing method, apparatus, device and computer readable storage medium
CN115640300A (en) Big data management method, system, electronic equipment and storage medium
CN111581056A (en) Software engineering database maintenance and early warning system based on artificial intelligence
CN112306820A (en) Log operation and maintenance root cause analysis method and device, electronic equipment and storage medium
US11886508B2 (en) Adaptive tiering for database data of a replica group
CN117806916A (en) Multi-unit server lightweight alarm correlation mining and converging method and system
US20230409567A1 (en) Managing Multiple Types of Databases Using a Single User Interface (UI) That Includes Voice Recognition and Artificial Intelligence (AI)
CN111414355A (en) Offshore wind farm data monitoring and storing system, method and device
US11838171B2 (en) Proactive network application problem log analyzer
CN113779215A (en) Data processing platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination