CN114416423A

CN114416423A - Root cause positioning method and system based on machine learning

Info

Publication number: CN114416423A
Application number: CN202210089130.1A
Authority: CN
Inventors: 唐卓; 向婷; 李肯立; 李虹宇; 伍祚瑶; 王啸; 罗文明; 程欣威
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-04-29
Anticipated expiration: 2042-01-25
Also published as: CN114416423B

Abstract

The invention discloses a root cause positioning method based on machine learning, which comprises the following steps: acquiring call chain data consisting of data of a call process in the micro-service application system, and acquiring service index data, container index, middleware index, host index and database index data of the micro-service application system; inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the acquired service index data into a trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, and if so, performing root cause detection on the obtained detection result to obtain a node where the fault occurs and a performance index which causes the fault. The method can solve the technical problems that the existing root cause detection method based on static threshold setting is low in accuracy rate and the existing root cause detection method based on the sliding window is difficult to identify the periodic characteristics of actual data indexes.

Description

Root cause positioning method and system based on machine learning

Technical Field

The invention belongs to the technical field of intelligent operation and maintenance, and particularly relates to a root cause positioning method and system based on machine learning.

Background

A large Internet company provides services to the outside through a service cluster, meanwhile, business services are bloated along with the increase of product requirements, the large-scale services are split on the structure, the large-scale services are split into small-size independent services, and each small service is managed by an independent process to provide the services to the outside, namely 'micro-services'.

The microservice application system uses a microservice architecture to build applications as independent components and run each application process as a service. These services communicate over well-defined interfaces using lightweight APIs. These services are built around business functions, each performing a function independently.

After the micro service architecture is started, a plurality of services become distributed, after the services are split, the requests of users can be processed through different service nodes, and the results are returned to the users. Then, if any node has a problem in the whole call chain, the final result may be abnormal. In such a complex environment, it is not easy to find out a specific service node accurately and efficiently. Therefore, under the background of the call chain, each node through which the request passes is recorded, a complete call chain monitoring system is formed, and error links are checked according to the call chain log.

Most of the root cause detection methods of the existing micro-service application systems adopt a threshold alarm setting method, which is specifically divided into a root cause detection method based on static threshold setting and a root cause detection method based on a sliding window.

The root cause detection method based on static threshold setting means that abnormality is detected when a certain fixed threshold is exceeded. However, the accuracy of this method is low, because the anomaly threshold may change with time, and the fixed threshold cannot cover all scenes, which may result in some specific scenes not being detected; the fixed threshold problem is solved based on root cause detection of a sliding window, a time sequence can be framed according to a specified unit length, and therefore a statistical index in a frame is calculated, but the periodic characteristic of an actual data index is difficult to identify; furthermore, such unsupervised threshold alarm settings can only identify single-index anomalies and are not well interpretable.

Disclosure of Invention

In view of the above defects or improvement requirements of the prior art, the present invention provides a root cause positioning method and system based on machine learning, and aims to solve the technical problems that the accuracy of the existing root cause detection method based on static threshold setting is low, the existing root cause detection method based on sliding window is difficult to identify the periodic characteristics of the actual data indexes, and the existing root cause detection method based on sliding window is only capable of identifying single index abnormality and has no good interpretability.

To achieve the above object, according to one aspect of the present invention, there is provided a root cause positioning method based on machine learning, including the steps of:

(1) acquiring call chain data consisting of data of a call process in the micro-service application system;

(2) acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;

(3) inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired in the step (2) into a trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the step (4), otherwise, ending the process;

(4) and (4) carrying out root cause detection on the detection result obtained in the step (3) to obtain the node where the fault occurs and the performance index causing the fault.

Preferably, the data of the calling procedure includes a timestamp and a call chain id of the data, a call type, a service execution time, a caller id, a data id, and a name of the micro service application system.

The service index data of the micro-service application system comprises the name, the timestamp, the average calling time, the service volume, the success number and the success rate of the micro-service system; the purpose of obtaining container index, middleware index, host index and database index data is to locate specific abnormal performance index.

Preferably, the SVM network is trained by the following substeps:

(3-1) acquiring service index data of the micro-service application system, sequentially carrying out normalization and data annotation processing on the acquired service index data, and dividing the processed service index data serving as a data set into a training set and a test set;

and (3-2) initializing the parameters of the SVM network to obtain the initialized SVM network.

(3-3) inputting the training set obtained in the step (3-1) into the SVM network initialized in the step (3-2) for training to obtain a preliminarily trained SVM network;

and (3-4) testing the SVM network preliminarily trained in the step (3-3) by using the test set obtained in the step (3-1) to obtain a finally trained SVM network.

Preferably, the step (3-2) is specifically to set the penalty coefficient C of the SVM network to 1.0, set the kernel function kernel to a linear kernel function linear, and set the weighting parameter class _ weight to the proportion of positive and negative samples in the data set.

Preferably, step (4) comprises the sub-steps of:

(4-1) using the timestamp in the detection result obtained in the step (3), inquiring a calling chain id corresponding to the timestamp in the calling chain data obtained in the step (1), using the calling chain id to obtain all the calling chain ids in the calling chain data which are equal to the inquired calling chain id, and sequentially establishing a directed edge from the data id in each piece of obtained data to the calling chain id to obtain a directed graph;

(4-2) carrying out frequent subgraph mining on the directed graph obtained in the step (4-1) to obtain a frequently-appearing substructure in the directed graph;

(4-3) establishing a graph database by using frequently-occurring substructures in the directed graph obtained in the step (4-2);

(4-4) acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as having abnormal nodes in the graph database as a first set, acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as null in the graph database as a second set, and then taking the intersection set between the first set and the second set as an abnormal root cause node;

and (4-5) taking the root cause node of each abnormality obtained in the step (4-4) as an input node, centrally inquiring four numerical values corresponding to the input node in the container index, the middleware index, the host index and the database index obtained in the step (2), calculating the change rate, the first-order difference, the second-order difference and the sliding window average change rate of the four numerical values as characteristic values, keeping the characteristic values exceeding a set threshold value through a robust random forest harvesting RRCF algorithm, setting the other values to zero to form a relatively sparse characteristic vector, and classifying the characteristic vector through a K nearest neighbor method with K being 1 to determine the final root cause performance index (namely, which one of the container, the middleware, the host or the database is abnormal), wherein the threshold value is set by artificial experience according to the size of an index data set in the system.

Preferably, step (4-2) comprises the sub-steps of:

(4-2-1) initializing each node in the directed graph as a sub-structure;

(4-2-2) calculating scores of all current substructures, and selecting the substructures with the lowest scores as the optimal substructures;

(4-2-3) adding a vertex (namely an edge adjacent to all nodes in all the substructures in the step (4-2-2)), and expanding the optimal substructures obtained in the step (4-2-2) by using the vertex to obtain expanded substructures as current substructures;

(4-2-4) repeating the step (4-2-2) and the step (4-3-3) until all the optimal substructures are obtained, wherein all the optimal substructures form the frequently-appearing substructures in the directed graph.

Preferably, the score of the sub-structure is I (S) + I (G | S), where S denotes the sub-structure in the directed graph G, (G | S) denotes graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) denotes the description length of the directed graph where the sub-structure S is located, and I (G | S) denotes the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.

Preferably, the length i (S) of the description of the directed graph in which the substructure S lies is equal to:

I(S)＝v+r+e

where v represents the number of bits required to construct the vertex label of the directed graph in which the substructure S lies:

v＝lgv+v lg(l_u)

wherein l_uRepresenting the set of all vertices of the directed graph. r represents the number of bits required for each row of the adjacency matrix a in converting the directed graph into the adjacency matrix a, and has:

where b is max (ki)₎Max denotes taking the maximum value, k_iRepresents the number of 1 in the ith row of the adjacency matrix a;

e represents the number of bits required for the edge represented by a [ i, j ] ═ 1 in the adjacency matrix a, that is, the number of bits required for all the edges in the graph is stored, and there are:

where m denotes the size of the edge of the adjacency matrix a represented by a [ i, j ] ═ 1, and u denotes the vertex of the adjacency matrix a.

According to another aspect of the present invention, there is provided a root cause localization system based on machine learning, including:

the first module is used for acquiring call chain data consisting of data of a call process in the micro-service application system;

the second module is used for acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;

the third module is used for inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired by the second module into the trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the fourth module, otherwise, ending the process;

and the fourth module is used for carrying out root cause detection on the detection result obtained by the third module so as to obtain the node with the fault and the performance index causing the fault.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) because the step (3) is adopted, and a machine learning method is adopted, the characteristics can be directly input to obtain the result of whether the fault occurs, and the technical problem of low accuracy of the existing root cause detection method based on static threshold setting can be solved;

(2) because the step (4-2) and the step (4-4) are adopted, the frequent subgraph mining method is used, and other parameters are not required to be set, the technical problem that the periodic characteristic of the actual data index is difficult to identify by the root cause detection method based on the sliding window can be solved;

(3) according to the method, the step (1) and the step (3) are adopted, so that a plurality of indexes of the micro-service application system are collected and trained to obtain a trained network, and the technical problem that a sliding window-based root cause detection method can only identify single index abnormity and has no good interpretability can be solved;

(4) because the invention adopts the step (4-3) and the step (4-4), the invention uses the frequent subgraph mining algorithm to mine the frequently appearing substructure, and compares the frequently appearing substructure with other substructure in the graph database to obtain the root cause node, thereby solving the technical problem that the model can not be well trained when the abnormal data is too little.

Drawings

FIG. 1 is a flow chart of the root cause location method based on machine learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Root cause positioning is an important and difficult-to-implement field of intelligent operation and maintenance (AIOPS for short), and relates to the mutual combination of induction analysis and deductive reasoning, and the comprehensive application of reasoning from a large number theorem to a logic complete chain. The massive data of the micro-service architecture lays a foundation for correlation analysis, but the business abnormal cases are very lacking, so that the micro-service architecture needs to have strong AI (artificial intelligence) capability from correlation to causality: deductive reasoning is carried out based on the operation and maintenance domain knowledge, and meanwhile, the process and conclusion of causal reasoning are interpretable so as to facilitate repeated analysis and continuous optimization. The method adopts a method based on frequent subgraph division, so that non-abnormal data can be used in the root cause positioning process, the analyzed data source is increased, and the interpretability is better.

As shown in fig. 1, the present invention provides a root cause localization method based on machine learning, which comprises the following steps:

specifically, the data of the calling procedure specifically includes a timestamp and a call chain id (which are in one-to-one correspondence) of the data, a call type, a service execution time, a caller id, a data id, and a name of the micro service application system.

specifically, the service index data of the micro-service application system comprises the name, the timestamp, the average calling time, the service volume, the success number and the success rate of the micro-service system; the purpose of acquiring container indexes, middleware indexes, host indexes and database index data is to locate specific abnormal performance indexes;

(3) inputting the timestamp, the average calling time, the traffic volume, the success number and the success rate in the service index data acquired in the step (2) into a trained Support Vector Machine (SVM) network to obtain a detection result, judging whether the detection result is abnormal or not, entering the step (4) if the detection result is abnormal, and ending the process if the detection result is not abnormal;

the method has the advantages that the root cause detection is carried out by using multiple indexes, the multiple indexes are integrated, and the accuracy is improved.

Specifically, the SVM network in this step is obtained by training the following substeps:

specifically, the data set adopted in this step is service index data collected in the same micro service application system, and specifically, as shown in step (2), the data set is calculated according to the following formula (7): 3 into training set and test set, i.e. randomly dividing 70% as training set and the remaining 30% as test set.

In the labeling process of this step, the result is labeled as a real number vector, for example, an abnormal condition is labeled as 1, and no abnormal condition is labeled as-1.

Specifically, in this step, a penalty coefficient C of the SVM network is set to 1.0, a kernel function kernel is set to a linear kernel function, and a weighting parameter class is set to a ratio of positive and negative samples in a data set, which is specifically represented by a list: [ number of abnormal samples, number of abnormal samples ];

The steps (3-1) to (3-4) have the advantages that the SVM network classification is adopted, the multi-index problem is solved, and the effect is relatively good when the data volume is small.

Specifically, step (4) includes the following substeps:

specifically, this step includes the following substeps:

(4-2-1) initializing each node in the directed graph as a sub-structure;

specifically, the score of the sub-structure is I (S) + I (G | S), where S denotes the sub-structure in the directed graph G, (G | S) denotes graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) denotes the description length of the directed graph in which the sub-structure S is located, and I (G | S) denotes the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.

The calculation of the description length i (S) of the directed graph in which the substructure S is located can be divided into three parts, namely:

I(S)＝v+r+e

v＝lgv+v lg(l_u)

where lu represents the set of all vertices of the directed graph. r represents the number of bits required for each row of adjacency matrix a in converting the directed graph into adjacency matrix a (another representation of the graph), and has:

where b is max (k)_i) Max denotes taking the maximum value, k_iIndicates the number of 1's in the ith row of the adjacency matrix a.

e represents the number of bits required for an edge represented by a [ i, j ] ═ 1 in the adjacency matrix a (that is, the number of bits required for all edges in the graph is stored).

The above steps (4-2-1) to (4-2-4) have advantages in that the interpretability of the root cause localization is increased and the use of call chain data without abnormality is increased.

specifically, in this step, the frequently-appearing substructures in the directed graph obtained in step (4-2) are numbered sequentially, then the substructures in which abnormal nodes exist are manually marked as the abnormal nodes, and if the abnormal nodes do not exist in the substructures, the substructures are marked as null.

(4-5) taking the root node of each abnormality obtained in the step (4-4) as an input node, intensively inquiring four numerical values corresponding to the input node in the container index, the middleware index, the host index and the database index obtained in the step (2), calculating the change rate, the first-order difference, the second-order difference and the average change rate of the sliding window of the four numerical values as characteristic values, keeping characteristic values exceeding a set threshold value (set by artificial experience according to the size of an index data set in a system) through a Robust Random Cut Forest (RRCF) algorithm, the others are zeroed to form relatively sparse feature vectors, and the feature vectors are classified by K-Nearest neighbor (KNN) with K ═ 1 to determine the final root cause performance index (i.e., which of the container, middleware, host, or database is abnormal).

Specifically, to make the change of the feature anomaly value more obvious, the RRCF algorithm is selected to convert the feature value into an RRCF score.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A root cause positioning method based on machine learning is characterized by comprising the following steps:

2. The machine-learning based root cause localization method of claim 1,

the data of the calling process comprises a time stamp and a calling chain id of the data, a calling type, a service execution time, a caller id, a data id and a name of the micro-service application system.

3. The root cause localization method based on machine learning of claim 1 or 2, characterized in that the SVM network is trained by the following sub-steps:

4. The root cause localization method based on machine learning of any one of claims 1 to 3, wherein the step (3-2) is specifically to set the penalty coefficient C of the SVM network to 1.0, set the kernel function kernel to linear kernel function linear, and set the weighting parameter class _ weight to the ratio of positive and negative samples in the data set.

5. The root cause localization method based on machine learning of claim 4, wherein the step (4) comprises the following sub-steps:

6. The root cause localization method based on machine learning of claim 5, wherein the step (4-2) comprises the following sub-steps:

(4-2-1) initializing each node in the directed graph as a sub-structure;

7. The root cause localization method based on machine learning according to claim 6, wherein the score of the sub-structure is I (S) + I (G | S), where S represents the sub-structure in the directed graph G, (G | S) represents graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) represents the description length of the directed graph where the sub-structure S is located, and I (G | S) represents the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.

8. The machine-learning based root cause localization method of claim 7,

the length of the description of the directed graph in which the substructure S lies, I (S), is equal to:

I(S)＝v+r+e

v＝lgv+v lg(l_u)

where b is max (k)_i) Max denotes taking the maximum value, k_iRepresents the number of 1 in the ith row of the adjacency matrix a;

9. A root cause location system based on machine learning, comprising: