CN113806180A

CN113806180A - Unsupervised intelligent noise reduction processing method

Info

Publication number: CN113806180A
Application number: CN202111117474.0A
Authority: CN
Inventors: 雷建椿
Original assignee: Tengyun Yuezhi Technology Shenzhen Co ltd
Current assignee: Tengyun Yuezhi Technology Shenzhen Co ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2021-12-17
Anticipated expiration: 2041-09-23
Also published as: CN113806180B

Abstract

The invention discloses an unsupervised intelligent noise reduction processing method, which enables operation and maintenance personnel to distinguish an alarm storm, saves labor cost, and can quickly lock root cause alarm, thereby quickly solving the problem before a larger accident occurs. According to the method, through an unsupervised intelligent noise reduction processing mode, firstly, the relevance among alarms is effectively utilized to perform unsupervised clustering on the alarms, then, the business rules and the expert experience are used to perform root cause analysis, and finally, the feedback experience of operation and maintenance personnel is automatically absorbed, so that the algorithm is more accurate along with the use of the operation and maintenance personnel, the purposes that the operation and maintenance personnel can focus on the problems and solve the problems are achieved, and meanwhile, the method has the property of self-learning improvement.

Description

Unsupervised intelligent noise reduction processing method

Technical Field

The invention relates to the technical field of AIOPS intelligent operation and maintenance systems, in particular to an unsupervised intelligent noise reduction processing method.

Background

Traditional operation and maintenance face a single structure, such as deploying application services on one server, because a person only needs to log in to check whether the application services are normal to judge whether the current services are normal. With the development of big data, operation and maintenance personnel often face a plurality of associated operation and maintenance service objects, and therefore the operation and maintenance tools are used for establishing the monitoring of each operation and maintenance service object. Because certain association exists among the operation and maintenance service objects, when one service has a problem, the alarm of other services can be caused, and an alarm storm is formed when the number of alarms is large in a short time. When an alarm storm comes, operation and maintenance personnel often need to spend a large amount of time for extracting effective information. Sometimes, the information is inundated by the alarm due to too many alarms, so that the effective information cannot be extracted, and the operation difficulty of the operation and maintenance personnel is increased.

The existing method of establishing a statistical mathematical model through historical data predicts the threshold value of future alarm. And when the alarm quantity of the actual situation exceeds a threshold value, actively using Apriori or adopting a policy association rule to carry out alarm combination and sending the alarm combination to the same operation and maintenance personnel. The method has certain intelligent noise reduction value, but has certain defects in specific application scenes:

1. due to the lack of alarm root cause positioning, in an actual scene, operation and maintenance personnel pay more attention to root cause alarm in an alarm storm, and the alarm storm can be solved by solving the root cause alarm;

2. the algorithm model does not have self-adaptive capacity, the alarm is compressed by depending on historical data, the experience of operation and maintenance personnel in processing the alarm storm every time subsequently cannot be absorbed, and the model lacks self-adaptive capacity.

Accordingly, the prior art is deficient and needs improvement.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an unsupervised intelligent noise reduction processing method.

The technical scheme of the invention is as follows: the method for unsupervised intelligent noise reduction processing is provided, and comprises the following steps:

step 1: generating a model, and sequentially carrying out data requirement, data preprocessing, feature processing, category generation and alarm importance generation on the model, and initial training of five modules of intelligent noise reduction and root cause output;

step 2: updating the relation matrix and the importance matrix on the basis of initial training, absorbing information contained in a newly added sample, and performing incremental training;

and step 3: processing experience of operation and maintenance personnel on the algorithm result is used as user feedback information, the user feedback information is fed back to the database, and iteration is carried out by using the user feedback information as newly added data;

and 4, step 4: and processing the alarm storm according to training and iterative learning.

Further, the alarm data in the data requirement in the step 1 is collected by the cloud platform and stored in the database.

Further, the data preprocessing in step 1 specifically comprises the following steps:

according to the business knowledge, the 'NAME +' | '+ IP _ ADDRESS' is used as an alarm main key, missing values and abnormal values are processed at the same time, three methods of supplementing, discarding and retaining are used for the missing values, and the abnormal values are directly discarded.

Further, the specific steps of the feature processing in step 1 are as follows:

converting the alarm attribute to form a useful feature; the alarm attribute is divided into a time attribute and a non-time attribute, and a corresponding conversion method is adopted according to the difference of the alarm attribute.

Further, for the time attribute, the specific processing manner is as follows:

a. extracting the maximum time and the minimum time in all samples;

b. the minimum time is used as the start, the maximum time is used as the betweenness, the division is carried out by taking the step length every minute, if the alarm survival time period contains the time point, the mark is '1', and if the alarm survival time period contains the time point, the mark is '0';

further, for non-temporal attributes, if the two alarms are the same tenant, a 1.0 flag is used to have a relationship; if the two alarms are not of the same tenant, 0 is used for marking, and therefore a relation matrix is obtained.

Further, the specific steps of the category generation and the alarm importance generation in step 1 are as follows:

after the feature processing is completed, obtaining the vector expression of the alarm, and carrying out cluster analysis on the basis of the vector expression; and screening n alarms closest to each alarm by using a KNN algorithm based on cosine distance to form a link, and setting a threshold value for filtering the link without great correlation. And taking each alarm as a node, and taking the contact as an edge to form a graph, wherein each connected subgraph in the graph is a category.

Further, the intelligent noise reduction and root cause output in step 1 specifically comprises the following steps:

calculating corresponding confidence degrees according to the classification results and each alarm importance table;

confidence coefficient ═ (time correlation coefficient +1)/4+ (master alarm importance-slave alarm importance)/2 × (maximum importance value within class-minimum importance value within class).

Further, the update rule in the incremental training of step 2 is as follows:

if a new alarm occurs, the similarity of the new alarm with the data support can be directly calculated and filled, the similarity of the new alarm without the data support is directly supplemented with 0, and the importance matrixes are the same;

if the alarm is an original alarm, the new similarity of the original alarm is calculated and averaged with the original similarity, and the importance matrix is processed in the same way.

By adopting the scheme, the method and the system have the advantages that through an unsupervised intelligent noise reduction processing mode, firstly, the relevance among alarms is effectively utilized to perform unsupervised clustering on the alarms, then, the business rules and the expert experience are used to perform root cause analysis, and finally, the feedback experience of operation and maintenance personnel is automatically absorbed, so that the algorithm is more accurate along with the use of the operation and maintenance personnel, the purposes that the operation and maintenance personnel can focus on the problems and solve the problems are achieved, and meanwhile, the method and the system have the property of self-learning improvement.

Drawings

FIG. 1 is a block flow diagram of the present invention.

Fig. 2 is an initial training flowchart.

Fig. 3 is a flow chart of incremental training.

Fig. 4 is a flow chart of user feedback iteration.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

Referring to fig. 1, the present invention provides an unsupervised intelligent noise reduction processing method, which includes the following steps:

step 1: referring to fig. 2, a model is generated, and the model is subjected to data requirement, data preprocessing, feature processing, category generation, alarm importance generation, and initial training of five modules for intelligent noise reduction and root cause output in sequence.

Data requirements (database):

and the alarm data is collected by the cloud platform and stored in the database. In this embodiment, the database includes the following attributes: NAME (event NAME), CREATE _ TIME (creation TIME), receiver _ TIME (recovery TIME), CLOSE _ TIME (closing TIME), IP _ ADDRESS (IP ADDRESS), age (warning source), ower (tenant), company ID.

Data preprocessing:

the alarm main key is an identity representation of an alarm, and takes NAME + '|' + IP _ ADDRESS as the alarm main key according to business knowledge, and simultaneously processes missing values and abnormal values, wherein three methods of supplementing, discarding and retaining are used for the missing values, and the abnormal values are directly discarded.

1. For the missing of the main key and CREATE _ TIME, directly discarding the alarm sample;

2. for the loss of CLOSE _ TIME, complement _ TIME is adopted for supplement, and if the complement _ TIME is also lost, the CLOSE _ TIME is discarded;

3. for the missing of other fields, the record is reserved and is not processed;

4. and directly discarding the abnormal service alarm data, such as creation time > closing time.

Feature processing:

For the time attribute, the specific processing mode is as follows:

a. extracting the maximum time and the minimum time in all samples;

b. the minimum time is used as the start, the maximum time is used as the intermediate position, the division is carried out by taking the step length per minute, if the alarm survival time period contains the time point, the mark is '1', and if the other time points contain the time point, the mark is '0'.

TABLE 1 alarm example

ID	CREATE_TIME	CLOSE_TIME
			A	2021-08-11 16:10:37	2021-08-09 16:13:37
B	2021-08-09 16:12:37	2021-08-09 16:18:37

Sample features of alarm generation are as follows (date markers omitted):

TABLE 2 alarm time characteristics Table

ID	16:10	16:11	16:12	16:13	16:14	16:15	16:16	16:17	16:18	16:19	16:20
												A	1	1	1	1	0	0	0	0	0	0	0
B	0	0	1	1	1	1	1	1	1	0	0

Based on the data of the alarm time profile table of table 2, a time correlation coefficient matrix is generated, assuming that the time correlation coefficient matrix (named coe _ time) is as follows:

TABLE 3 time correlation coefficient Table (coe _ time)

For feature extraction of non-temporal attributes, such as a tenant field, if two alarms are the same tenant, a relationship is marked, and the relationship is represented by 1.0, and otherwise, the relationship is represented by 0, so that the following tenant relationship matrix can be obtained.

Table 4 tenant relationship table

	A	B	C	D
					A	1.0	1.0	0	1.0
B	1.0	1.0	0	0
					C	0	0	1.0	1.0
D	1.0	0	1.0	1.0

Similarly, the warning source, company and network segment can also derive corresponding relationship matrices, and it is assumed that these relationship matrices are consistent with the relationship matrices of the tenants.

Category generation and alarm importance generation (generating coefficient matrix, KNN, connected subgraph, generating alarm importance matrix, generating alarm importance):

and after the characteristic processing is finished, obtaining the vector expression of the alarm, and carrying out cluster analysis on the basis of the vector expression. Because of unsupervised training, the distance is preferably defined, and any two element values have no equivalence in vector expression, the distance between two alarms is calculated by adopting cosine distance. The invention adopts a KNN algorithm based on cosine distance to screen out n alarms nearest to each alarm, thereby forming a connection, and simultaneously setting a threshold value to filter the connection without great correlation. And taking each alarm as a node, and taking the contact as an edge to form a graph, wherein each connected subgraph in the graph is a category.

The specific exemplary steps in this embodiment are as follows:

1. merging the characteristics of various relational tables:

all relationships for alarms are as follows:

TABLE 5 various alarm relationship tables

Categories	Temporal characteristics	Tenant characteristics	Source of alarm feature	Company characteristics	Network segment characteristics
						A	(1.0,0.1,0.1,0.1)	(1.0,1.0,0,1.0)	(1.0,1.0,0,1.0)	(1.0,1.0,0,1.0)	(1.0,1.0,0,1.0)
B	(0.1,1.0,0.7,0.6)	(1.0,1.0,0,0)	(1.0,1.0,0,0)	(1.0,1.0,0,0)	(1.0,1.0,0,0)
						C	(0.5,0.7,1.0,0.4)	(0,0,1.0,1.0)	(0,0,1.0,1.0)	(0,0,1.0,1.0)	(0,0,1.0,1.0)
D	(0.8,0.6,0.4,1.0)	(1.0,0,1.0,1.0)	(1.0,0,1.0,1.0)	(1.0,0,1.0,1.0)	(1.0,0,1.0,1.0)

The vector of alarms is expressed as:

TABLE 6 alarm vector expression table

Categories	Feature(s)
		A	(1.0,0.1,0.1,0.1,1.0,1.0,0,1.0,1.0,1.0,0,1.0,1.0,1.0,0,1.0,1.0,1.0,0,1.0)
B	(0.1,1.0,0.7,0.6,1.0,1.0,0,0,1.0,1.0,0,0,1.0,1.0,0,0,1.0,1.0,0,0)
		C	(0.5,0.7,1.0,0.4,0,0,1.0,1.0,0,0,1.0,1.0,0,0,1.0,1.0,0,0,1.0,1.0)
D	(0.8,0.6,0.4,1.0,1.0,0,1.0,1.0,1.0,0,1.0,1.0,1.0,0,1.0,1.0,1.0,0,1.0,1.0)

2. And returning n alarms with the nearest distance for each alarm by using a KNN unsupervised clustering algorithm based on cosine distance:

the original cosine distance formula is:

KNN uses 1 minus the above formula as the distance. Here, if n is set to 2, the nearest 2 alarms and distances are as follows:

table 72 nearest distance alarm tables

Alarm system	Two recent alerts	Corresponding distance
			A	(A,B)	(0,0.2650897)
B	(B,A)	(0,0.2650897)
			C	(C,D)	(0,0.18749536)
D	(D,C)	(0,0.18749536)

Since the set of nodes whose distance is to be calculated contains itself, the closest node is itself, the distance is 0, and assuming that the threshold is 0.2, i.e. all links whose distance is greater than 0.2 are discarded (the smaller the distance is, the more similar), according to the above table, the relationship between a and B can be discarded, i.e. only the link between D and C remains. Four nodes A, B, C, D are used as nodes of the graph, and (D, C) are used as edges to be input into the graph, and a connected subgraph is solved to obtain A, B and (C, D) three connected subgraphs.

That is, the alarms are classified, and the alarms in the class need to be sorted according to importance so as to confirm the master-slave relationship.

3. Most of the historical root cause alarms have shorter survival time than other alarms, because other alarms are caused by root cause alarms, and the solution of the root cause alarms can solve other alarms

Rule one is as follows: if the alarm time period A is the proper subset of the alarm time period B, the alarm importance of the alarm time period A is determined to be greater than the alarm importance of the alarm time period B, important [ A ] [ B ] ═ 2 is given, important [ B ] [ A ] ═ 2(important is an importance matrix)

Rule two: if the alarm time period A is equal to the alarm time period B, the alarm importance of AB is determined to be the same, and important [ A ] [ B ] is given as 1, and important [ B ] [ A ] is given as 1

Rule three: if A, B alarm time is not above two relations, it is determined that AB alarm can not distinguish importance, and important [ A ] [ B ] is given 0, and important [ B ] [ A ] is given 0

According to the above rules, in combination with the alarm time profile table of table 2, it is assumed that the importance matrix is formed as follows:

TABLE 8 rule importance Table

	A	B	C	D
					A	1	2	0	1
B	-2	1	2	0
					C	0	-2	1	1
D	1	0	1	1

Forming a rule importance summary table for each alarm according to the rule importance table of Table 8

TABLE 9 rule importance summary sheet

Alarm system	A	B	C	D
					Rule importance summarization	4	1	0	3

Because the influence caused by different alarms is different, for example, the influence degrees of machine downtime and abnormal service are inconsistent, the alarm classification method and the alarm classification device are combined with the Weiying cloud to give different weights to the alarm classification.

TABLE 10 alarm level importance Table

Assume that the categories of alerts A, B, C, D are as follows:

table 11 importance table for each alarm level

Alarm system	Rank of	Score value
			A	0-recovery event	0
B	1-information level events	1
			C	3-failure level event	3
D	4-Accident level time	4

In conjunction with the summary of the importance of the rules in Table 9 and the importance of each alarm level in Table 11, the final importance of each alarm table can be obtained:

table 12 importance of each alarm table

Alarm system	A	B	C	D
					Importance of	4	2	3	7

Intelligent noise reduction and root cause output (master-slave alarm):

In special cases: if the maximum importance value in the class is the same as the minimum importance value in the class, setting the maximum importance value in the class as 0; the confidence of a single class is set directly to 1.

The final output results obtained are:

table 13 master-slave alarm output table

Main alarm	Slave alerts	Importance of
			A		100.00％
B		100.00％
			D	C	85.00％

When actually reporting an emergency and asking for help or increased vigilance to the fortune dimension personnel, report an emergency and ask for help or increased vigilance the coincidence when meetting D and C, can only send D to fortune dimension personnel and report an emergency and ask for help or increased vigilance to reach the intelligence and fall the purpose of making an uproar, report an emergency and ask for help or increased vigilance C and merge D simultaneously and report an emergency and ask for help or increased vigilance during, fortune dimension personnel only need solve D and report an emergency and ask for help or increased vigilance, just can solve C in step and report an emergency and increased vigilance.

Step 2: referring to fig. 3, the relationship matrix and the importance matrix are updated based on the initial training, and the information contained in the newly added sample is extracted for incremental training. The incremental training route is consistent with the initial training basic route, data is sequentially subjected to preprocessing, feature processing, model training and result output, the change part is mainly the updating of the relationship matrix and the importance matrix, the updating is mainly aimed at absorbing information contained in a newly-added sample, for example, the A and B in the new sample are more intimate and the master and slave are more clear, the model iteration process needs to iterate in the direction of the relationship closeness and the master and slave clearness, and considering that only the time relationship matrix has adjustability, the original intention can be influenced by the adjustment of other attribute relationship matrices, so that the updating rule (only used for the time relationship matrix) is set as follows:

1. if a new alarm occurs, the similarity of the new alarm with the data support can be directly calculated and filled, the similarity of the new alarm without the data support is directly supplemented with 0, and the importance matrixes are the same;

2. if the alarm is an original alarm, the new similarity of the original alarm is calculated and averaged with the original similarity, and the importance matrix is processed in the same way.

In this embodiment, the original time correlation coefficient matrix

Table 14 original time correlation coefficient table

	A	B	C	D
					A	1.0	0.1	0.1	0.1
B	0.1	1.0	0.7	0.6
					C	0.5	0.7	1.0	0.4
D	0.8	0.6	0.4	1.0

The new alarm forms a time relation matrix as follows:

table 15 new time correlation coefficient table

	A	B	E
				A	1	0.5	0.5
B	0.5	1	0.7
				E	0.5	0.7	1

The updated time relationship matrix is:

TABLE 16 Final Table of time correlation coefficients

The updating of the importance matrix is consistent with the updating of the time correlation coefficient matrix, and the final output is operated according to the subsequent flow in the initial training chart, so that the relationship and the confidence coefficient between the updated alarms can be obtained.

And step 3: referring to fig. 4, the processing experience of the operation and maintenance staff on the algorithm result is used as the user feedback information, which is fed back to the database, and the user feedback information is used as the new data for iteration.

Because the feedback of the operation and maintenance personnel is absorbed, compared with incremental training, two steps of data preprocessing and feature processing are not needed. The algorithm mainly updates the time correlation coefficient matrix and the importance matrix. The user has two results for a master-slave relationship determination: in favor of or against, the model is iterated in different directions according to different results, so the update rule (applied to the time relation matrix) is as follows:

assume a master-B slave.

1. When the time correlation coefficient coe [ A ] [ B ] and coe [ B ] [ A ] are increased by 0.1 in favor of the time correlation coefficient, the time correlation coefficient is ensured to fall between [ -1,1], and only an extreme value is taken when the time correlation coefficient exceeds the interval; the importance coefficient important [ A ] [ B ] is promoted by 2, and the important [ B ] [ A ] is reduced by 2;

2. when in reverse time, coe A and coe B are reduced by 0.1, and the situation that the coe A and coe B fall between-1 and 1 is guaranteed, and only an extreme value is taken when the exceeding interval is exceeded; the importance coefficient important [ B ] [ A ] is raised by 2, and important [ A ] [ B ] is lowered by 2. In this embodiment, the specific steps are as follows:

original time relation matrix:

TABLE 17 original time correlation coefficient table

Assuming that the A master B slave is approved and the C master D slave is disapproved, update to:

TABLE 18 Final Table of time correlation coefficients

	A	B	C	D
					A	1.0	0.2	0.1	0.1
B	0.2	1.0	0.7	0.6
					C	0.5	0.7	1.0	0.3
D	0.8	0.6	0.3	1.0

And calculating the values before and after the change, wherein the AB distance is reduced (representing intimacy), the CD distance is increased (representing distancing), and the service requirement is met.

Importance matrix:

TABLE 19 original rule importance Table

	A	B	C	D
					A	1	2	0	1
B	-2	1	2	0
					C	0	-2	1	1
D	1	0	1	1

TABLE 20 final table of rule importance

	A	B	C	D
					A	1	4	0	1
B	-4	1	2	0
					C	0	-2	1	-1
D	1	0	3	1

Calculating the values before and after the change, wherein in the AB, the importance of A is increased, and the importance of B is reduced; in the CD, the importance of C is reduced, and the importance of D is increased, so that the CD meets the service requirements. And finally, outputting to operate according to the subsequent flow in the initial training chart, and obtaining the relationship and the confidence coefficient between the updated alarms.

The method can learn the characteristics of the samples under the condition that the samples are not marked, classify the samples, analyze the importance of the samples and lock the root cause alarm, and can absorb the operation experience of operation and maintenance personnel to perform continuous optimization iteration so as to improve the prediction accuracy of the algorithm.

In summary, in the invention, through an unsupervised intelligent noise reduction processing mode, firstly, the relevance between alarms is effectively utilized to perform unsupervised clustering of the alarms, then, the business rules and the expert experience are used to perform root cause analysis, and finally, the feedback experience of the operation and maintenance personnel is automatically absorbed, so that the algorithm is more accurate along with the use of the operation and maintenance personnel, the purposes that the operation and maintenance personnel can focus on the problems and solve the problems are achieved, and meanwhile, the unsupervised intelligent noise reduction processing mode has the property of self-learning improvement.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An unsupervised intelligent noise reduction processing method is characterized by comprising the following steps:

2. The unsupervised intelligent noise reduction processing method according to claim 1, wherein the alarm data in the data requirement in step 1 is collected by a cloud platform and stored in a database.

3. The unsupervised intelligent noise reduction processing method according to claim 1, wherein the data preprocessing in step 1 comprises the following specific steps:

4. The unsupervised intelligent noise reduction processing method according to claim 1, wherein the specific steps of the feature processing in the step 1 are as follows:

5. The unsupervised intelligent noise reduction processing method according to claim 4, wherein specific processing modes are as follows for time attributes:

a. extracting the maximum time and the minimum time in all samples;

6. the unsupervised intelligent denoising processing method of claim 4, wherein for non-temporal attributes, if two alarms are the same tenant, a 1.0 flag is used to have a relationship; if the two alarms are not of the same tenant, 0 is used for marking, and therefore a relation matrix is obtained.

7. The unsupervised intelligent noise reduction processing method according to claim 1, wherein the specific steps of the category generation and the alarm importance generation in the step 1 are as follows:

8. The unsupervised intelligent denoising processing method according to claim 1, wherein the intelligent denoising and root cause output in step 1 comprises the following specific steps:

9. The unsupervised intelligent noise reduction processing method according to claim 1, wherein the update rule in the incremental training of step 2 is as follows: