CN117149500A

CN117149500A - Abnormal root cause obtaining method and system based on index data and log data

Info

Publication number: CN117149500A
Application number: CN202311417601.8A
Authority: CN
Inventors: 张竞超; 张泽锟; 余螯
Original assignee: Anhui Sigao Intelligent Technology Co ltd
Current assignee: Anhui Sigao Intelligent Technology Co ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2023-12-01
Anticipated expiration: 2043-10-30
Also published as: CN117149500B

Abstract

The invention provides a method for obtaining an abnormal root cause based on index data and log data, which comprises the following steps: s1: acquiring index data and log data of a micro-service system; s2: calculating to obtain an index anomaly score sequence set MASS of index data through a BIRCH clustering algorithm; s3: obtaining a log anomaly score sequence LAS of log data through calculation of a deep algorithm; s4: carrying out association analysis on the clustering result of each index data in the index abnormal score sequence set MASS and the log abnormal score sequence LAS to obtain association; s5: and obtaining an abnormal root cause index through relevancy sorting. According to the method, the association degree analysis is carried out through the clustering result of the index data and the log abnormal score sequence, the abnormal root cause can be quantified through the association degree sequencing, the operation and maintenance personnel can be assisted to quickly locate the problem root cause, and the operation and maintenance loss of enterprises is reduced.

Description

Abnormal root cause obtaining method and system based on index data and log data

Technical Field

The invention relates to the field of intelligent operation and maintenance, in particular to a method and a system for obtaining an abnormal root cause based on index data and log data.

Background

The rapid growth of the internet has led to a dramatic expansion in the size and complexity of microservices systems. Most of Internet enterprises have too single operation and maintenance means and still stay in the stage of manual analysis. The traditional operation and maintenance mode of manual analysis is gradually lagged, and the problems of large scale and high complexity cannot be solved.

In recent years, with the development of the field of artificial intelligence, data-driven automation algorithms have been successfully applied in a variety of complex scenarios, which also provides a trigger for solving these problems. The basis of the data-driven automation algorithm is data, and journals and metrics are important components of operation and maintenance observability for micro-service systems. The log is an important data source for detecting the abnormity of the micro-service system, and records detailed operation information during the operation of the micro-service system, a time stamp of an event, related methods, parameters and the like. The inspection log can help maintenance manager to know the behavior of the system and find possible abnormal information. The system operation index is timing data collected at fixed time, such as CPU usage, corresponding delay, etc. The commonly collected metrics are in the form of (time stamps, values). When the numerical value presents abnormality, such as sudden increase and drop, etc., the micro-service related to the numerical value presents some abnormality, and operation and maintenance personnel are required to position root cause in time and take effective measures.

However, the existing automatic detection method also has the problems that the root cause of the micro-service system is only suitable for the index level, the root cause analysis of the log alarm problem is only suitable, the cause analysis of the log data and the index data does not consider the abnormal attribute in the log operation process, and the like.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for obtaining an abnormal root cause based on index data and log data, comprising the following steps:

s1: acquiring index data and log data of a micro-service system;

s2: calculating to obtain an index anomaly score sequence set MASS of index data through a BIRCH clustering algorithm;

s3: obtaining a log anomaly score sequence LAS of log data through calculation of a deep algorithm;

s4: carrying out association analysis on the clustering result of each index data in the index abnormal score sequence set MASS and the log abnormal score sequence LAS to obtain association;

s5: and obtaining an abnormal root cause index through relevancy sorting.

Preferably, step S2 specifically includes:

s21: normalizing the obtained N index data to convert the index data into [0,1 ]]Index vector m= { M within range ₁ ,m ₂ ,...,m _N }；

S22: for each index data M in the index vector M by BIRCH clustering algorithm _u Clustering is carried out respectively, and a clustering result set of each index data is used as an index anomaly score sequence set MASS= { MAS ₁ ,MAS ₂ ,...,MAS _N }, MAS therein _u And as a clustering result of the ith index data, the value range of u is 1 to N.

Preferably, the step S3 specifically includes:

s31: analyzing the log data into a log key sequence and a parameter vector sequence according to the log category;

s32: analyzing the log key sequence based on deep log to obtain a log key anomaly score sequence LAS _t ；

S33: obtaining a parameter vector anomaly score sequence LAS based on deep analysis of the parameter vector sequence _p ；

S34: anomaly score sequence LAS through log keys _t And a sequence of anomaly scores LAS for the parameter vector _p And calculating a log anomaly score sequence LAS for obtaining log data.

Preferably, step S32 specifically includes:

s321: setting a first time window, and acquiring a log key set window of a log key sequence in the first time window _h ={k _h-H ,k _h-H+1 ,...,k _h Where H is the time, H is the length of the first time window, k _h Is the h log key; predicting log key k of log key set at time h+1 through deep log _h+1 ；

S322: obtaining log key k by standard polynomial logic function calculation _h+1 Is set of probability distributions p= { k ₁ :p ₁ ,k ₂ :p ₂ ,...,k _i :p _i ,...,k _g :p _g I is the number of the log key, p _i Representing log key k _h+1 Is a log key k _i G is the number of log key types;

s323: if the true log key of the log key at the time h+1 is k _i And p is _i If the log key abnormality score is smaller than the Threshold, judging that the log is abnormal in execution path, and enabling the log key abnormality score AS at the time of h+1 to be equal to or smaller than the Threshold _th =Threshold-p _i The method comprises the steps of carrying out a first treatment on the surface of the If p _i If the log key abnormality score is not smaller than Threshold, judging that the log is normal, and enabling the log key abnormality score AS at the time of h+1 to be _th =0；

S324: let h=h+1;

s325: repeating steps S321-S324, and constructing a log key anomaly score sequence LAS through anomaly scores of all log keys in the log key sequence _t 。

Preferably, step S33 specifically includes:

s331: setting a second time window, and acquiring a parameter vector set e of the parameter vector sequence in the second time window _q ={v _q-Q ,v _q-Q+1 ,...,v _q Q is the number of the parameter vector, Q is the length of the second time window, v _q Is the q-th parameter vector; by deep log pair parameter vector set e _q Predicting to obtain a prediction parameter vector setCalculate->And e _q+1 Parameter vector error z between _q+1 ；

S332: error of parameter vector z _q+1 Modeling as a gaussian distribution; if z _q+1 Within the high confidence interval of the Gaussian distribution, the parameter vector v is judged _q Normally, set the parameter vector v _q Abnormal fraction AS of (2) _pq =0; otherwise, judging the parameter vector v _q Abnormality, set parameter vector v _q Abnormal fraction AS of (2) _pq =1；

S333: let p=p+1;

s334: repeating steps S331-S333, and constructing a parameter vector anomaly score sequence LAS by anomaly scores of all parameter vectors in the parameter vector sequence _p 。

Preferably, the calculation formula of the log anomaly score sequence LAS is as follows:

wherein w is a hyper-parameter.

Preferably, the calculation formula of the association degree is:

wherein MI (MAS) _u LAS) is MAS _u Correlation with LAS, x is MAS _u In (2), y is the log anomaly score in LAS, p (x, y) is the joint probability distribution function of x and y, p (x) is the edge probability distribution function of x, and p (y) is the edge probability distribution function of y.

An abnormal root cause acquisition system based on index data and log data, comprising:

the data acquisition module is used for acquiring index data and log data of the micro service system;

the index anomaly score calculation module is used for calculating an index anomaly score sequence set MASS of the index data through a BIRCH clustering algorithm;

the log anomaly score calculation module is used for calculating a log anomaly score sequence LAS for obtaining log data through a deep algorithm;

the association degree analysis module is used for carrying out association degree analysis on the clustering result of each index data in the index abnormal score sequence set MASS and the log abnormal score sequence LAS to obtain association degree;

the abnormal root indicator acquisition module is used for acquiring abnormal root indicators through relevancy sorting.

The invention has the following beneficial effects:

the index data and the log data of the micro-service system are used for carrying out exception analysis, so that the index data and the log data can cover more types of exceptions, and the exception reporting phenomenon caused by a single data source is reduced; and the association degree analysis is carried out on the clustering result of the index data and the log abnormal score sequence, and the abnormal root cause can be quantified through the association degree sequencing, so that the operation and maintenance personnel can be assisted to quickly locate the problem root cause, and the operation and maintenance loss of an enterprise is reduced.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, the invention provides an abnormal root cause obtaining method based on index data and log data, which aims at the problem of low compatibility of observability data (index+log) in a micro-service system, can reduce abnormal report missing phenomenon caused by a single data source and assist operation and maintenance personnel to quickly locate the root cause of the problem.

Comprising the following steps:

s1: acquiring index data and log data of a micro-service system;

s5: and obtaining an abnormal root cause index through relevancy sorting.

Further, the step S1 specifically includes:

step S11: setting an oversampling parameter, carrying out oversampling collection on abnormal point data (log data+index data) by expanding the length of abnormal time, and assuming that the abnormal time period is L, and expanding the abnormal time period into (1+alpha) L in the process of collecting the data, wherein alpha=0.4;

step S12: the method comprises the steps of oversampling and collecting log data Raw_Logs of a micro-service system, wherein the log data comprise a log timestamp, a cmdb_id, a log file name and log content; the log data Raw_Logs are stored in an elastic search database;

step S13: oversampling at a time interval of 5s to collect micro-service system index data raw_metrics including performance index data and business index data; the performance index data records the state information of the server component, such as CPU utilization rate, memory utilization rate, network packet loss rate and the like; the service index data comprise system response rate, success rate, average response time and the like; the index data raw_metrics are stored in the elastic search database.

Further, the step S2 specifically includes:

Specifically, the calculation formula of the normalization process is:

wherein x' represents a normalization result, x represents source index data, and the normalization processing process converts the index data into index vectors so as to ensure that different index data have comparability;

Further, the step S3 specifically includes:

specifically, a Drain log analysis tool is applied to analyze the log data Raw_log into a form of 'log key + parameter vector' according to the log category;

Further, the step S32 specifically includes:

S324: let h=h+1;

s325: repeating steps S321-S324, and constructing a log key abnormality score sequence through abnormality scores of all log keys in the log key sequenceColumn LAS _t 。

Further, step S33 specifically includes:

S333: let p=p+1;

Further, the calculation formula of the log anomaly score sequence LAS is as follows:

where w is a super parameter (super parameter w is set to 0.6).

Further, the calculation formula of the association degree is:

Further, the step S5 specifically includes:

after the association degree between the clustering results of all the index data and the log anomaly score sequence is obtained through calculation, the association degrees are sequentially ranked from high to low, and if the association degree is larger, the ranking in the list is higher, namely the index data is more likely to be an anomaly root cause.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third, etc. do not denote any order, but rather the terms first, second, third, etc. are used to interpret the terms as labels.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. An abnormal root cause obtaining method based on index data and log data, comprising the steps of:

s1: acquiring index data and log data of a micro-service system;

s5: and obtaining an abnormal root cause index through relevancy sorting.

2. The method for obtaining an abnormal root cause based on index data and log data according to claim 1, wherein step S2 is specifically:

S22: for each index data M in the index vector M by BIRCH clustering algorithm _u Clustering respectively, toClustering result set of each index data is used as index anomaly score sequence set MASS= { MAS ₁ ,MAS ₂ ,...,MAS _N }, MAS therein _u And as a clustering result of the ith index data, the value range of u is 1 to N.

3. The method for obtaining an abnormal root cause based on index data and log data according to claim 1, wherein step S3 is specifically:

4. The method for obtaining an abnormal root cause based on index data and log data according to claim 3, wherein step S32 is specifically:

S323：if the true log key of the log key at the time h+1 is k _i And p is _i If the log key abnormality score is smaller than the Threshold, judging that the log is abnormal in execution path, and enabling the log key abnormality score AS at the time of h+1 to be equal to or smaller than the Threshold _th =Threshold-p _i The method comprises the steps of carrying out a first treatment on the surface of the If p _i If the log key abnormality score is not smaller than Threshold, judging that the log is normal, and enabling the log key abnormality score AS at the time of h+1 to be _th =0；

S324: let h=h+1;

5. The method for obtaining an abnormal root cause based on index data and log data according to claim 3, wherein step S33 is specifically:

S333: let p=p+1;

s334: repeating steps S331-S333, passing the parameter vectorAnomaly scores of all parameter vectors in the sequence, and constructing a parameter vector anomaly score sequence LAS _p 。

6. The method for obtaining an abnormal root cause based on index data and log data according to claim 3, wherein the calculation formula of the log abnormal score sequence LAS is:

wherein w is a hyper-parameter.

7. The method for obtaining an abnormal root cause based on index data and log data according to claim 1, wherein the calculation formula of the degree of association is:

8. An abnormal root cause acquisition system based on index data and log data, comprising: