CN111338915B

CN111338915B - Dynamic alarm grading method and device, electronic equipment and storage medium

Info

Publication number: CN111338915B
Application number: CN202010411127.8A
Authority: CN
Inventors: 赵能文; 刘大鹏; 隋楷心; 张文池
Original assignee: Beijing Bishi Technology Co ltd
Current assignee: Beijing Bishi Technology Co ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-09-01
Anticipated expiration: 2040-05-15
Also published as: CN111338915A

Abstract

The invention discloses a dynamic alarm grading method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: training the sequencing model by using the historical data of the alarm to obtain a training model; and sequencing the on-line data of the alarm by using the training model to obtain an alarm rating. The invention initiatively models the dynamic alarm grading problem into a sequencing problem based on machine learning, gives the severity grading of the alarm in an online and self-adaptive manner based on the training model, and has high grading accuracy, so that an engineer can process the serious alarm preferentially according to the alarm grading, and the failure solving efficiency is improved.

Description

Dynamic alarm grading method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of system alarms, and in particular, to a dynamic alarm ranking method, apparatus, electronic device, and computer-readable storage medium.

Background

A large online service system consists of many components to support a large number of concurrent users. In order to ensure the service quality and the user experience, various monitoring data such as indexes, logs, call chains and the like need to be collected from each component, and a plurality of alarm rules are manually set, so that an alarm is generated once the monitoring data violates the alarm rules (for example, the CPU utilization rate exceeds 80%, a fail keyword appears in a log file and the like), and is sent to an engineer for checking. If the alarm is severe, the engineer may create a work order for troubleshooting and diagnosis. The alarm data may contain a number of attributes such as alarm time, alarm content, alarm type, alarm source system, alarm source machine, alarm level, and alarm off time.

Due to the complexity and dynamics of online services, the system may concurrently generate a large number of alarms, beyond the processing power of the engineer. Therefore, in practice, the classification rules are often defined manually, and the alarms are classified into different priorities (e.g., P1-error, P2-warning, P3-info; CPU utilization is P1 for over 90% and P2 for over 70%). The engineer is mainly concerned with the highest level of alarms, i.e. critical alarms. However, even so, the amount of triggering of a severe alarm is still large. In addition, manual definition and maintenance rules are difficult to have unified standard switching, manpower is consumed, the accuracy of the rule-based alarm grading method is not high, and the rule-based alarm grading method cannot adapt to dynamic changes of a system.

Disclosure of Invention

In order to solve the above technical problems, embodiments of the present invention provide an accurate and adaptive alarm ranking algorithm, which can automatically rank the severity of a large number of concurrent alarms, and preferentially recommend the severe alarms to an engineer, thereby helping the engineer to quickly find a potential fault and reduce fault repairing time.

One aspect of the present invention provides a dynamic alarm ranking method, which includes the following steps:

training the sequencing model by using the historical data of the alarm to obtain a training model; and

and sequencing the on-line data of the alarm by using the training model to obtain an alarm rating.

Optionally, the historical data includes a work order, alarm data and index data;

the step of training the sequencing model by using the historical data of the alarm to obtain a training model comprises the following steps:

extracting the label of the work order;

extracting alarm characteristics of the alarm data, extracting index characteristics of the index data, and combining the alarm characteristics and the index characteristics to obtain a characteristic vector; and

and inputting the labels and the feature vectors into the sequencing model, and training the sequencing model to obtain the training model.

Optionally, the online data includes online alarm data and online index data;

the step of using the training model to sequence the on-line data of the alarm to obtain the alarm rating comprises the following steps:

extracting alarm characteristics of the online alarm data, extracting index characteristics of the online index data, and combining the alarm characteristics and the index characteristics to obtain an online characteristic vector; and

and inputting the online characteristic vector into the training model to obtain the alarm rating.

Optionally, the alarm feature includes at least one of the following features: text features, text entropy, timing features; wherein the text features are obtained using a learning-based two-word topic model (BTM); the text entropy is calculated by adopting Inverse Document Frequency (IDF); the time sequence characteristics comprise the alarm frequency, the alarm period, the alarm quantity in unit time or the alarm interval time;

the index features are obtained by adopting a Long Short Term Memory (LSTM) network-based multi-time series anomaly detection algorithm.

In another aspect of the present invention, there is provided a dynamic alarm rating device, including:

the off-line training module is used for training the sequencing model by using the historical data of the alarm to obtain a training model; and

and the online sequencing module is used for sequencing the online data of the alarm by using the training model to obtain the alarm rating.

the offline training module comprises:

the label extraction module is used for extracting the label of the work order;

the characteristic vector extraction module is used for extracting the alarm characteristics of the alarm data, extracting the index characteristics of the index data and combining the alarm characteristics and the index characteristics to obtain a characteristic vector; and

and the model training module is used for inputting the labels and the feature vectors into the sequencing model and training the sequencing model to obtain the training model.

Optionally, the online data includes online alarm data and online index data;

the online ranking module comprises:

the online characteristic vector extraction module is used for extracting the alarm characteristics of the online alarm data, extracting the index characteristics of the online index data and combining the alarm characteristics and the index characteristics to obtain an online characteristic vector; and

and the grading module is used for inputting the online characteristic vector into the training model to obtain the alarm grading.

Another aspect of the present invention is to provide an electronic device, including:

at least one processor; and

a memory coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to implement the method of the present invention.

Another aspect of the present invention is to provide a computer-readable storage medium, in which a computer program is stored, which, when executed, is capable of implementing the method of the present invention.

According to the method, a work order and an alarm handling record in historical data are utilized, severity scores are automatically marked for each historical alarm, and a series of interpretable and physically meaningful features are extracted from alarm data and index data to represent the severity of the alarm based on the ideas of data fusion and feature fusion. The invention initiatively models the dynamic alarm grading problem into a sequencing problem based on machine learning, gives the severity grading of the alarm in an online and self-adaptive manner based on the training model, and has high grading accuracy, so that an engineer can process the serious alarm preferentially according to the alarm grading, and the failure solving efficiency is improved.

Drawings

FIG. 1 is a flow chart of a dynamic alarm ranking method in an embodiment of the invention;

FIG. 2 is a flow chart of a dynamic alarm ranking method in an embodiment of the invention;

FIG. 3 is a flowchart illustrating the step S1 of the dynamic alarm ranking method in accordance with an embodiment of the present invention;

FIG. 4 is a flowchart illustrating the step S2 of the dynamic alarm ranking method in accordance with an embodiment of the present invention;

FIG. 5a is a bar graph of alarm severity score (Severityscore) corresponding to 14 topics (Topic) in an embodiment of the present invention;

FIG. 5b is a graph of a warning severity score (Severityscore) line corresponding to text Entropy (Encopy) in an embodiment of the present invention;

FIG. 5c is a line drawing of alarm Severity scores (sensitivity score) corresponding to business system anomaly scores (multivariable error of businessKPIs) and machine anomaly scores (multivariable error of server KPIs) in the embodiment of the present invention;

FIG. 6 is a block diagram of a dynamic alarm rating device in an embodiment of the present invention;

FIG. 7 is a block diagram of an offline training module in an embodiment of the present invention;

FIG. 8 is a block diagram of an online ranking module in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in FIG. 1, the technical solution of the present invention is divided into two stages of off-line training and on-line sequencing. In the off-line training stage, because a work order is generally created for serious alarm and historical alarm processing records are stored, the severity score of the historical alarm data can be obtained as a label according to the work order in the historical data; based on the ideas of data fusion and feature fusion, respectively extracting features from alarm data and index data of historical data to obtain feature vectors, wherein the feature vectors can be used for representing the abnormal degree of the alarm; and training the sequencing model according to the obtained labels and the feature vectors to obtain a training model. In the on-line sequencing stage, the feature vectors of the alarm data arriving in real time are obtained by the same feature extraction method and input into the learned sequencing model, and the sequencing model can output the severity ranking of a large number of concurrent alarms at the current moment, so that an engineer can be guided to preferentially process the alarms with higher severity.

According to one aspect of the invention, a dynamic alarm ranking method is provided.

As shown in fig. 2, the method specifically includes the following steps:

s1: training the sequencing model by using the historical data of the alarm to obtain a training model;

s2: and sequencing the on-line data of the alarm by using the training model to obtain an alarm rating.

In one embodiment, the historical data of the alarm may include work orders, alarm data, index data, and the like, and the online data of the alarm may include online alarm data, online index data, and the like.

As shown in fig. 3, the S1 step further includes:

s101: extracting the label of the work order;

s102: extracting alarm characteristics of the alarm data, extracting index characteristics of the index data, and combining the alarm characteristics and the index characteristics to obtain a characteristic vector; and

s103: and inputting the labels and the feature vectors into the sequencing model, and training the sequencing model to obtain the training model.

As shown in fig. 4, the S2 step further includes:

s201: extracting alarm characteristics of the online alarm data, extracting index characteristics of the online index data, and combining the alarm characteristics and the index characteristics to obtain an online characteristic vector; and

s202: and inputting the online characteristic vector into the training model to obtain the alarm rating.

The method for acquiring the label, the alarm characteristic and the index characteristic of the work order is explained in detail below.

First, label

The severity score of the historical alarm data is marked by the work order in the historical data. Generally, for historical alarms, engineers review the alarms and contact the associated administrator for treatment, and then record treatment records, and serious alarms create a work order that is followed and reviewed by the person in charge. The handling record of the work order is manually filled by a person, and the severity of the alarm can be reflected more truly and reliably. The alarm handling records of the work order can be clustered by using a text clustering method based on TF-IDF and k-means, and the alarms in each cluster obtained by clustering are respectively subjected to unified severity scoring, wherein the severity scoring is the marking of the work order. Several categories of alarm handling records and corresponding severity scores are presented below, with 1 being the highest severity and 0 being the lowest severity. For example, if a work order is created for a certain alarm, the alarm is determined by the engineer to be of a higher severity level; if an alarm is on the white list or there is no record of processing, its severity will be low.

First, None (0)

Second, alarm in white list (0.1)

Third, the alarm has been automatically restored (0.2)

Fourth, contact the application manager, confirm the impact on the business (0.4)

Fifth, for known reasons, the alarm is fixed (0.6)

Sixth, contact the application manager, having an impact on the business, is now repaired (0.8)

Seventh, create event ticket, continue follow-up (1)

Second, alarm feature

Before extracting the alarm characteristics from the alarm data, some preprocessing needs to be performed on the alarm data. If the alarm data is a text mixed with Chinese and English, the alarm content needs to be segmented by Chinese segmentation (such as Chinese segmentation). In addition, the alarm content is generally semi-structured text, which contains more stop words, symbols and variables. Therefore, it is necessary to remove meaningless stop words and symbols, and then, by using the method of extracting the log template, process the alarm data, filter out the variables, and obtain the alarm template. Thereafter, the alert features may be extracted from:

first, the subject feature

The alarm template can be regarded as a short text in the operation and maintenance field, semantic information contained in different alarm contents is different, and different semantics often correspond to different severity. Therefore, the invention applies the popular theme model in the natural language processing to the alarm data for the first time to extract the hidden semantic features from the alarm content. In consideration of the short text characteristics of the alarm content, it is not ideal to directly use the conventional lda (late Dirichlet allocation) Topic Model, and therefore, a Bilingual Topic Model (BTM) designed for short text may be used. By preprocessing the alarm data, some problems (such as Chinese and English mixing, stop words and the like) existing in the alarm data are overcome, so that the BTM can dig out interpretable semantic information. Given a topic number n, the BTM can find hidden topics and keywords corresponding to each topic, and select the optimal topic number according to a coherence score (coherence score). The topics learned by the BTM in the actual alarm data and the corresponding keywords (n =14) are shown below.

T # 1: oracle, connection, database, space, pool, process, lock.

T # 2: syslog, alarm, error, stack, record, hardware, alarm.

T # 3: monitoring, environment, host, battery, humidity, machine room, voltage.

...

T # 13: system, transaction amount, response, threshold, time, traffic, value.

T # 14: switch, virtual, communication, connection, response, ping, network.

It can be seen that the keywords given by the BTM are all of a certain physical meaning and interpretability, for example, topic 1 (T # 1) can be presumed to be related to database alarms, and topic 2 (T # 2) is related to system log alarms. For a piece of alarm information, the BTM may give a probability that it belongs to each topic, and the probability corresponding to each topic is the topic feature of the piece of alarm information. For example for an alarm: "oracle table space usage reaches 78% and exceeds the threshold", the BTM can give a subject feature which is a vector of length 14 [0.78,0.05,0.02, …,0.19,0.04 ].

Fig. 5a shows the alarm Severity scores (Severity score) for the 14 topics (Topic) described above. It can be seen that the subject matter features are meaningful to differentiate the severity of an alert.

Second, text entropy

The alarm content is often a combination of words, different words having different weights in identifying serious alarms, e.g. "break" may be more serious than "port". Therefore, the text entropy of the alarm data can be extracted to measure the severity degree of the alarm. In the text mining technology, Inverse Document Frequency (IDF) is a commonly used method for measuring the importance of words, and the method can reduce the weight of commonly used words and increase the weight of rare words. For the word ω, its word entropy is calculated:

whereinNIs the number of all alarms, N_ωIs the number of alarms containing the word ω. From this word entropy, words that often appear in alerts have a lower severity. And the text entropy of a certain alarm is contained by itThe average of all word entropies is calculated.

FIG. 5b illustrates the alarm Severity score (sensitivity score) for text Entropy (control). It can be seen that as the value of text entropy increases, the corresponding alert severity score also tends to increase.

Third, timing characteristics

The timing characteristics may include the frequency of alarm occurrences, the period, the number of alarms, the interval time, etc.

Frequency: generally, the more frequently an alarm has historically occurred (e.g., CPU usage exceeds a threshold i value), the less severe it will be; conversely, if an alarm has historically been infrequent (low frequency alarms, such as a server down), its severity may be relatively high, requiring attention from engineers.

And (3) period: some alarms occur periodically, such as batch processing tasks during the night each day, which typically are redundant, causing high CPU utilization alarms. For one alarm a, we can obtain the time series c (a) = { c =¹(a),c²(a),…,c^k(a) In which c is^k(a) Is the number of times the alarm occurred in the kth time slice. Obviously, if the alarm a is periodic, the corresponding time series c (a) is also periodic.

The periodicity of the time series is characterized by an Auto-Correlation Function (ACF). When a time series x (i) having a length N and different lag times l is given, the following is calculated:

if the time series is periodic and the period is T, thenACF(l)Will show a sharp peak at T,2T,3T …. Thus can useACF(l)The maximum value of (c) characterizes the periodicity of the alarm.

The alarm quantity is as follows: generally, when a large number of alarms occur in a short time, it means that a serious malfunction may occur, requiring attention of an engineer.

The interval time is as follows: the interval time is the time interval between the current alarm and the previous alarm. Often, engineer attention is required if there is no alarm for a long time interval and an alarm suddenly appears.

Fourth, other features

In addition to the above-mentioned feature types, some key features may also be extracted from some other attributes of the alarm data itself, such as:

rule-based severity: the severity of the alarm data based on rules has certain reference significance for alarm grading;

and (3) warning time: the time of occurrence of an alarm also has an impact on the severity of the alarm, such as alarms during peak periods of traffic are relatively more important; according to the time when the alarm is sent, a series of time characteristics can be extracted, such as: whether the work day/holiday, day/night, during peak business period, etc.;

alarm type: generally, the alarm of the application class is important because it is strongly related to the quality of service.

Index characteristics

Some key business indexes can directly reflect the health state of the system. When an alarm occurs, if the key service indexes (such as transaction amount, response time, success rate and the like) of the corresponding service system and the key indexes (such as CPU utilization rate, memory utilization rate and the like) of the machine are greatly abnormal, the alarm is a serious alarm with high probability. A Long Short Term Memory (LSTM) network-based multi-time sequence anomaly detection algorithm may be employed to capture the anomaly scores corresponding to each key business index and machine index and the anomaly scores of the overall business system as the index features.

FIG. 5c shows alarm severity scores (Severitscore) corresponding to business system anomaly scores (Multivariate error of business KPIs) and machine anomaly scores (Multivariate error of server KPIs). It can be seen that the higher the system's anomaly score, the higher the severity of the alarm.

According to one aspect of the present invention, a dynamic alarm rating device is provided.

As shown in fig. 6, the apparatus includes:

the off-line training module 10 is used for training the sequencing model by using the historical data of the alarm to obtain a training model; and

and the online sequencing module 20 is configured to sequence the online data of the alarm by using the training model to obtain an alarm rating.

In one embodiment, the historical data includes work orders, alarm data, and indicator data.

As shown in fig. 7, the offline training module 10 includes:

a label extraction module 101, configured to extract a label of the work order;

a feature vector extraction module 102, configured to extract an alarm feature of the alarm data, extract an index feature of the index data, and combine the alarm feature and the index feature to obtain a feature vector; and

and the model training module 103 is configured to input the labels and the feature vectors into the ranking model, and train the ranking model to obtain the training model.

In one embodiment, the online data includes online alarm data and online indicator data.

As shown in fig. 8, the online ranking module 20 includes:

an online feature vector extraction module 201, configured to extract an alarm feature of the online alarm data, extract an index feature of the online index data, and combine the alarm feature and the index feature to obtain an online feature vector; and

and the grading module 202 is used for inputting the online feature vector into the training model to obtain the alarm grading.

According to another aspect of the present invention, there is provided an electronic apparatus, comprising:

at least one processor; and

a memory coupled to the at least one processor; wherein,

According to another aspect of the present invention, there is provided a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed, the method of the present invention can be implemented.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and devices may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method for transmitting/receiving the power saving signal according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A dynamic alarm rating method, comprising the steps of:

sequencing the on-line data of the alarm by using the training model to obtain an alarm rating;

the online data comprises online alarm data and online index data;

inputting the online characteristic vector into the training model to obtain the alarm rating;

the historical data comprises work orders, alarm data and index data;

extracting the label of the work order;

2. The dynamic alarm rating method of claim 1,

the alert feature comprises at least one of the following features: text characteristics, text entropy and time sequence characteristics; wherein the text features are obtained using a learning-based two-word topic model (BTM); the text entropy is calculated by adopting Inverse Document Frequency (IDF); the time sequence characteristics comprise the alarm frequency, the alarm period, the alarm quantity in unit time or the alarm interval time;

3. A dynamic alarm rating device, the device comprising:

the online sequencing module is used for sequencing the online data of the alarm by using the training model to obtain an alarm rating;

the online data comprises online alarm data and online index data;

the online ranking module comprises:

the grading module is used for inputting the online feature vector into the training model to obtain the alarm grading;

the historical data comprises work orders, alarm data and index data;

the offline training module comprises:

the label extraction module is used for extracting the label of the work order;

4. The dynamic alarm rating device of claim 3,

the alert feature comprises at least one of the following features: text features, text entropy, timing features; wherein the text features are obtained using a learning-based two-word topic model (BTM); the text entropy is calculated by adopting Inverse Document Frequency (IDF); the time sequence characteristics comprise the alarm frequency, the alarm period, the alarm quantity in unit time or the alarm interval time;

5. An electronic device, comprising:

at least one processor; and

a memory coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to implement the method of any one of claims 1-2.

6. A computer-readable storage medium, in which a computer program is stored which, when executed, is capable of carrying out the method of any one of claims 1-2.