CN110837953A

CN110837953A - Automatic abnormal entity positioning analysis method

Info

Publication number: CN110837953A
Application number: CN201911019400.6A
Authority: CN
Inventors: 刘大鹏; 聂晓辉; 朱晶; 王耀
Original assignee: Beijing Bishi Technology Co Ltd
Current assignee: Beijing Bishi Technology Co Ltd
Priority date: 2019-10-24
Filing date: 2019-10-24
Publication date: 2020-02-25

Abstract

The invention discloses an automatic abnormal entity positioning and analyzing method, which triggers a positioning system to analyze when a service index is abnormal, and comprises index abnormal degree judgment, similar abnormal entity clustering and positioning result sequencing, the method is scientific and reasonable, is safe and convenient to use, realizes automatic manual fault positioning on abnormal entities and modules by utilizing the method, the entities refer to software and hardware equipment such as a physical machine, a virtual machine, middleware, a router, a switch and the like, can effectively assist operation and maintenance personnel to find problems, greatly reduces the recovery time of the faults, ensures the stability of the system, integrates the entities with similar abnormal indexes by utilizing a similar abnormal entity clustering algorithm, sequences the abnormal entities clustered by the clustering algorithm, and reduces the complicated procedures for the operation and maintenance personnel to arrange one by one, the troubleshooting burden of operation and maintenance personnel can be effectively reduced.

Description

Automatic abnormal entity positioning analysis method

Technical Field

The invention relates to the technical field of data exception handling, in particular to an automatic abnormal entity positioning analysis method.

Background

The IT basic setting environment in the data center is complex, each service is composed of a series of modules or middleware (such as Web servers, databases and the like), the modules are deployed on different servers of different machine rooms, each server has a large number of different monitoring indexes, when the service has problems, the abnormity of a certain basic component of the system is just like the wing of a butterfly, the alarm storm of a plurality of core transaction systems can be caused, once the service alarm is generated, managers must analyze the service alarm in a full power mode, and the entity and the module where the current abnormity is located are manually positioned, and the basic flow is as follows:

1. performing intelligent anomaly detection on each monitoring index to obtain a current anomaly index set;

2. manually sorting abnormal indexes according to experience, and judging whether the indexes are mild or not;

3. and according to the sorting result of the indexes, checking the abnormity one by one, finding out a problem root cause, and executing a fault recovery scheme.

However, because of numerous monitoring indexes, manual troubleshooting is difficult, a large amount of time is wasted in the above analysis for the service system, and the time for fault recovery is greatly increased, however, the process of manual fault location is changed into an automatic location process, and the following challenges are faced:

1. the abnormality detection is input for fault positioning, and most of the existing algorithms for abnormality detection only can give out whether an index is abnormal or not and cannot give out the abnormal degree;

2. the actual IT program module is deployed on multiple entities, each entity has the same monitoring index, when a fault occurs, a plurality of indexes are abnormal at the same time, and part of the indexes have similarity, so that how to further cluster is facilitated, and operation and maintenance personnel can check the problem that the sum is to be solved;

3. after the clustering algorithm is adopted, abnormal entities and abnormal indexes are clustered into multiple classes, each class reflects different types of faults, finally, operation and maintenance personnel can conduct one-by-one investigation, compared with the most original abnormal index number, the result is fewer, and a reasonable method is still needed for sorting according to the severity of the abnormality.

Therefore, an automatic abnormal entity location analysis method is urgently needed to solve the above problems.

Disclosure of Invention

The invention aims to provide an automatic abnormal entity positioning analysis method to solve the problems in the prior art.

In order to achieve the purpose, the invention provides the following technical scheme: an automatic abnormal entity positioning analysis method, when the abnormal entity positioning method is abnormal in service index, a positioning system is triggered to analyze;

the abnormal entity positioning analysis method comprises the following steps:

s1, judging the abnormal degree of the index;

s2, clustering similar abnormal entities;

and S3, sequencing the positioning results.

According to the above technical solution, in step S1, the index abnormal degree judgment means that when a service index fails, the abnormal degrees of a large number of related indexes are judged at the same time;

and collecting the index data of all entities and modules in the current period of time, and executing an abnormality detection algorithm to detect the abnormality degree of all indexes.

According to the above technical solution, in the step S2, the clustering of similar abnormal entities refers to clustering entities having similar abnormal service indexes;

and after the abnormal degrees of all indexes are obtained, clustering the entities with similar index abnormality through a clustering algorithm.

According to the above technical solution, in the step S3, the ranking of the positioning result refers to ranking the obtained various results according to the degree of abnormality;

and ranking all clustering results according to the abnormal degree by operating an intelligent sorting algorithm, and finally displaying the clustering results to operation and maintenance personnel.

According to the technical scheme, the purpose of judging the abnormal degree of the index is to detect the abnormal mode of the index time series curve by using an abnormal detection algorithm;

the index abnormal degree evaluation is to convert the abnormal detection problem into a statistical probability observation model, observe the mutation probability of the curve, and give the abnormal degree of the curve by using a traditional kernel density estimation algorithm to solve the problem of uniform quantization degree of different indexes;

setting the occurrence of a failure at time t₁For a single index, the algorithm collects the values before the fault occurs [ t-w ]₁T) data set { x_iAfter [ t, t + w ] and failure occur₂) Data set { x }_j}；

Converting anomaly detection problems into a data set { x ] for a given observation_iAnd { x }_jObserved probability P ({ x) }_i}|{x_j}), and finally, aiming at each index, calculating the probability P of sudden increase of the index by using an anomaly detection algorithm_o({x_j}|{x_i}) and probability of dip P_u({x_j}|{x_i}) for describing the degree of abnormality of the current index;

the anomaly detection algorithm is as follows:

wherein o and u represent abnormal degrees of abrupt increase and abrupt decrease, and m represents { x }_jThe magnitude of the index is different, and the value of m is different due to different sampling of different indexes, so that the abnormal degree is not comparable, and the exponential average value of the abnormal degree is calculated.

According to the technical scheme, the similar abnormal entity clustering algorithm aims to integrate entities with similar abnormal indexes, reduces the troubleshooting burden of operation and maintenance personnel, and is characterized in that the core of the similar abnormal entity clustering algorithm is to make abnormal index information of each entity into a vector and design a proper clustering algorithm for clustering the entities;

the clustering algorithm mainly comprises three parts: inputting a vector, a distance function and a clustering algorithm;

setting the input vector to (o)₀,u₀,o₂,u₂,...,o_n,u_n) Wherein o is_nAnd u_nRepresenting the abnormal degree of the sudden increase and the sudden decrease of the nth KPI respectively;

the distance function is used for calculating the clustering of two vectors;

the clustering algorithm is used for measuring the correlation between two variables X 'and Y', the value of the correlation is between-1 and 1, and the correlation is the quotient of the covariance and the standard deviation between the two variables;

the calculation algorithm is as follows:

the above formula defines the overall correlation coefficient, calculates the covariance and standard deviation of the sample to obtain the sample correlation coefficient, and is expressed by r, and the calculation formula is as follows:

a value of 1 for r means that X 'and Y' are positively correlated and Y 'increases with increasing X', a value of-1 for r means that X 'and Y' are negatively correlated and Y 'decreases with increasing X', a value of 0 for r means that there is no linear relationship between the two variables, and therefore the distance function of the algorithm is as follows:

Distance＝1-r；

the clustering algorithm divides areas with sufficient density into clusters, finds clusters of arbitrary shape in a spatial database with noise, defines clusters as the maximum set of density-connected points, the algorithm utilizes the concept of density-based clustering, i.e. requires the number of objects (points or other spatial objects) contained in a certain area in the clustering space to be not less than a given threshold, the algorithm has the significant advantages of fast clustering speed and being able to effectively process noise points and find spatial clusters of arbitrary shape, the algorithm requires the user to input 2 parameters: one parameter is the radius (Eps), representing the extent of the circular neighborhood centered at a given point P; another parameter is the number of minimum points in the neighborhood centered at point P (MinPts), when the condition is satisfied: calculating the radius Eps by taking the point P as a center and the number of points in the neighborhood with the radius Eps as no less than MinPts, and taking the point P as a core point: and according to the obtained k-distance set E of all the points, performing ascending sorting on the set E to obtain a k-distance set E ', fitting a change curve graph of k-distances in the sorted E' set, then drawing a curve, and determining a value of the k-distance corresponding to the position where the change occurs sharply as a value of the radius Eps.

According to the technical scheme, the positioning result ordering obtains entities with different representations by using a clustering algorithm, orders the entity indexes of different categories, and trains a proper ordering model Y by learning partial manually marked results by using a logistic regression method of supervised learning₁＝f(X₁)，X₁Is a characteristic part of, Y₁Is the score of the rank;

the features employed by the algorithm are defined as follows:

(1) maximum outlier: the value with the maximum degree of abnormality in all KPIs;

(2) sum of outliers: the sum of all the abnormal degree values of the KPI;

(3) number of abnormal KPIs: the number of KPIs with abnormal degrees higher than an abnormal threshold value is wanted;

(4) number of entities in proportion: dividing the number of the abnormal entity data by the number of the current module entity.

Compared with the prior art, the invention has the beneficial effects that: the method is used for realizing automatic manual fault location of abnormal entities and modules, can effectively assist operation and maintenance personnel to find problems, greatly reduces fault recovery time, ensures stability of the system, integrates entities with similar abnormal indexes by using a similarity abnormal entity clustering algorithm, sorts abnormal entities clustered by the clustering algorithm, reduces complicated procedures for one-by-one troubleshooting of the operation and maintenance personnel, and can effectively reduce troubleshooting burden of the operation and maintenance personnel.

Drawings

FIG. 1 is a schematic diagram illustrating steps of an automated abnormal entity location analysis method according to the present invention;

FIG. 2 is a diagram of an abnormal entity and module positioning architecture of an automated abnormal entity positioning analysis method according to the present invention;

FIG. 3 is a diagram of a service failure caused by a CPU failure according to an automated abnormal entity location analysis method of the present invention;

FIG. 4 is a flow chart of a clustering algorithm of an automated abnormal entity localization analysis method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example (b): as shown in fig. 1-4, an automated abnormal entity positioning analysis method triggers a positioning system to perform analysis when a service index is abnormal;

the abnormal entity positioning analysis method comprises the following steps:

s1, judging the abnormal degree of the index;

s2, clustering similar abnormal entities;

s3, sequencing the positioning results;

the method comprises the steps of firstly judging the abnormal degree of an index, detecting whether the index is abnormal in a sudden increase or a sudden decrease, carrying out energy-based display on the abnormal in the sudden increase or the sudden decrease by utilizing an algorithm to facilitate observation, then clustering the similar abnormal indexes by utilizing a clustering algorithm to facilitate checking of operation and maintenance personnel, finally sequencing the abnormal indexes and the abnormal entities according to a clustering result, scoring the final abnormal indexes and the abnormal entities according to a sequencing characteristic model and a sequencing characteristic by adopting the current advanced engine sequencing technology, and obtaining the points of the highest index abnormality and the highest entity abnormality according to a sequencing scoring result by the operation and maintenance personnel.

the anomaly detection algorithm is as follows:

the distance function is used for calculating the clustering of two vectors;

the calculation algorithm is as follows:

Distance＝1-r；

the features employed by the algorithm are defined as follows:

(2) sum of outliers: the sum of all the abnormal degree values of the KPI;

(4) number of entities in proportion: the number of the current module entity on the data number arm of the abnormal entity.

And sorting the whole clustering result according to the sorting score of the characteristic part, and checking the fault index by operation and maintenance personnel according to the sorting result, so that the fault checking is accelerated, the fault checking time is shortened, and the stability of the system is ensured.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. An automatic abnormal entity positioning analysis method is characterized in that: when the service index is abnormal, the abnormal entity positioning method triggers a positioning system to analyze;

the abnormal entity positioning analysis method comprises the following steps:

s1, judging the abnormal degree of the index;

s2, clustering similar abnormal entities;

and S3, sequencing the positioning results.

2. The automated abnormal entity localization analysis method of claim 1, wherein: in step S1, the index abnormal degree evaluation means that when a service index fails, the abnormal degree of a large number of related indexes is evaluated at the same time;

3. The automated abnormal entity localization analysis method of claim 1, wherein: in step S2, the clustering of similar abnormal entities refers to clustering entities with similar abnormal service indexes;

4. The automated abnormal entity localization analysis method of claim 1, wherein: in step S3, the ranking of the positioning result refers to ranking the obtained various results according to the degree of abnormality;

5. The automated abnormal entity localization analysis method according to claim 2, wherein: the purpose of the index abnormal degree evaluation is to detect an abnormal mode of an index time series curve by using an abnormal detection algorithm;

the index abnormal degree evaluation is to convert the abnormal detection problem into a statistical probability observation model, observe the probability of the sudden change of the curve and give the abnormal degree of the curve by using an abnormal detection algorithm;

the anomaly detection algorithm is as follows:

wherein o and u represent abnormal degrees of abrupt increase and abrupt decrease, and m represents { x }_jThe size of the leaf.

6. The automated anomalous entity localization analysis method of claim 3 wherein: the core of the similar abnormal entity clustering algorithm is to make abnormal index information of each entity into a vector, and design a proper clustering algorithm to cluster the entities;

setting the input vector to (o)₀,u₀,o₂,u₂,...,o_n,u_n) Wherein o is_nAnd u_nRespectively representing the abnormal degrees of sudden increase and sudden decrease of the nth KPI;

the distance function is used for calculating the clustering of two vectors;

the calculation algorithm is as follows:

Distance＝1-r；

the clustering algorithm divides regions of sufficient density into clusters, finds arbitrarily shaped clusters in a noisy spatial database, and defines clusters as the largest set of density-connected points.

7. The method of claim 4, wherein the method comprises: the positioning result ordering utilizes a clustering algorithm to obtain entities with different representations, orders the entity indexes of different classes, and trains a proper ordering model Y through learning partial manually marked results by utilizing a logistic regression method of supervised learning₁＝f(X₁)；

Wherein, X₁Is a characteristic part of, Y₁Is the score of the rank;

the features employed by the algorithm are defined as follows:

(2) sum of outliers: the sum of all the abnormal degree values of the KPI;