CN114416423A - Root cause positioning method and system based on machine learning - Google Patents
Root cause positioning method and system based on machine learning Download PDFInfo
- Publication number
- CN114416423A CN114416423A CN202210089130.1A CN202210089130A CN114416423A CN 114416423 A CN114416423 A CN 114416423A CN 202210089130 A CN202210089130 A CN 202210089130A CN 114416423 A CN114416423 A CN 114416423A
- Authority
- CN
- China
- Prior art keywords
- data
- index
- root cause
- directed graph
- service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000010801 machine learning Methods 0.000 title claims abstract description 19
- 238000001514 detection method Methods 0.000 claims abstract description 43
- 230000002159 abnormal effect Effects 0.000 claims abstract description 31
- 230000008569 process Effects 0.000 claims abstract description 17
- 239000011159 matrix material Substances 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 11
- 230000004807 localization Effects 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- 239000013598 vector Substances 0.000 claims description 7
- 230000005856 abnormality Effects 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 238000005065 mining Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000003306 harvesting Methods 0.000 claims description 2
- 238000007637 random forest analysis Methods 0.000 claims description 2
- 230000003068 static effect Effects 0.000 abstract description 5
- 230000000737 periodic effect Effects 0.000 abstract description 4
- 238000012706 support-vector machine Methods 0.000 description 21
- 238000012423 maintenance Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Hardware Design (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a root cause positioning method based on machine learning, which comprises the following steps: acquiring call chain data consisting of data of a call process in the micro-service application system, and acquiring service index data, container index, middleware index, host index and database index data of the micro-service application system; inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the acquired service index data into a trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, and if so, performing root cause detection on the obtained detection result to obtain a node where the fault occurs and a performance index which causes the fault. The method can solve the technical problems that the existing root cause detection method based on static threshold setting is low in accuracy rate and the existing root cause detection method based on the sliding window is difficult to identify the periodic characteristics of actual data indexes.
Description
Technical Field
The invention belongs to the technical field of intelligent operation and maintenance, and particularly relates to a root cause positioning method and system based on machine learning.
Background
A large Internet company provides services to the outside through a service cluster, meanwhile, business services are bloated along with the increase of product requirements, the large-scale services are split on the structure, the large-scale services are split into small-size independent services, and each small service is managed by an independent process to provide the services to the outside, namely 'micro-services'.
The microservice application system uses a microservice architecture to build applications as independent components and run each application process as a service. These services communicate over well-defined interfaces using lightweight APIs. These services are built around business functions, each performing a function independently.
After the micro service architecture is started, a plurality of services become distributed, after the services are split, the requests of users can be processed through different service nodes, and the results are returned to the users. Then, if any node has a problem in the whole call chain, the final result may be abnormal. In such a complex environment, it is not easy to find out a specific service node accurately and efficiently. Therefore, under the background of the call chain, each node through which the request passes is recorded, a complete call chain monitoring system is formed, and error links are checked according to the call chain log.
Most of the root cause detection methods of the existing micro-service application systems adopt a threshold alarm setting method, which is specifically divided into a root cause detection method based on static threshold setting and a root cause detection method based on a sliding window.
The root cause detection method based on static threshold setting means that abnormality is detected when a certain fixed threshold is exceeded. However, the accuracy of this method is low, because the anomaly threshold may change with time, and the fixed threshold cannot cover all scenes, which may result in some specific scenes not being detected; the fixed threshold problem is solved based on root cause detection of a sliding window, a time sequence can be framed according to a specified unit length, and therefore a statistical index in a frame is calculated, but the periodic characteristic of an actual data index is difficult to identify; furthermore, such unsupervised threshold alarm settings can only identify single-index anomalies and are not well interpretable.
Disclosure of Invention
In view of the above defects or improvement requirements of the prior art, the present invention provides a root cause positioning method and system based on machine learning, and aims to solve the technical problems that the accuracy of the existing root cause detection method based on static threshold setting is low, the existing root cause detection method based on sliding window is difficult to identify the periodic characteristics of the actual data indexes, and the existing root cause detection method based on sliding window is only capable of identifying single index abnormality and has no good interpretability.
To achieve the above object, according to one aspect of the present invention, there is provided a root cause positioning method based on machine learning, including the steps of:
(1) acquiring call chain data consisting of data of a call process in the micro-service application system;
(2) acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
(3) inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired in the step (2) into a trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the step (4), otherwise, ending the process;
(4) and (4) carrying out root cause detection on the detection result obtained in the step (3) to obtain the node where the fault occurs and the performance index causing the fault.
Preferably, the data of the calling procedure includes a timestamp and a call chain id of the data, a call type, a service execution time, a caller id, a data id, and a name of the micro service application system.
The service index data of the micro-service application system comprises the name, the timestamp, the average calling time, the service volume, the success number and the success rate of the micro-service system; the purpose of obtaining container index, middleware index, host index and database index data is to locate specific abnormal performance index.
Preferably, the SVM network is trained by the following substeps:
(3-1) acquiring service index data of the micro-service application system, sequentially carrying out normalization and data annotation processing on the acquired service index data, and dividing the processed service index data serving as a data set into a training set and a test set;
and (3-2) initializing the parameters of the SVM network to obtain the initialized SVM network.
(3-3) inputting the training set obtained in the step (3-1) into the SVM network initialized in the step (3-2) for training to obtain a preliminarily trained SVM network;
and (3-4) testing the SVM network preliminarily trained in the step (3-3) by using the test set obtained in the step (3-1) to obtain a finally trained SVM network.
Preferably, the step (3-2) is specifically to set the penalty coefficient C of the SVM network to 1.0, set the kernel function kernel to a linear kernel function linear, and set the weighting parameter class _ weight to the proportion of positive and negative samples in the data set.
Preferably, step (4) comprises the sub-steps of:
(4-1) using the timestamp in the detection result obtained in the step (3), inquiring a calling chain id corresponding to the timestamp in the calling chain data obtained in the step (1), using the calling chain id to obtain all the calling chain ids in the calling chain data which are equal to the inquired calling chain id, and sequentially establishing a directed edge from the data id in each piece of obtained data to the calling chain id to obtain a directed graph;
(4-2) carrying out frequent subgraph mining on the directed graph obtained in the step (4-1) to obtain a frequently-appearing substructure in the directed graph;
(4-3) establishing a graph database by using frequently-occurring substructures in the directed graph obtained in the step (4-2);
(4-4) acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as having abnormal nodes in the graph database as a first set, acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as null in the graph database as a second set, and then taking the intersection set between the first set and the second set as an abnormal root cause node;
and (4-5) taking the root cause node of each abnormality obtained in the step (4-4) as an input node, centrally inquiring four numerical values corresponding to the input node in the container index, the middleware index, the host index and the database index obtained in the step (2), calculating the change rate, the first-order difference, the second-order difference and the sliding window average change rate of the four numerical values as characteristic values, keeping the characteristic values exceeding a set threshold value through a robust random forest harvesting RRCF algorithm, setting the other values to zero to form a relatively sparse characteristic vector, and classifying the characteristic vector through a K nearest neighbor method with K being 1 to determine the final root cause performance index (namely, which one of the container, the middleware, the host or the database is abnormal), wherein the threshold value is set by artificial experience according to the size of an index data set in the system.
Preferably, step (4-2) comprises the sub-steps of:
(4-2-1) initializing each node in the directed graph as a sub-structure;
(4-2-2) calculating scores of all current substructures, and selecting the substructures with the lowest scores as the optimal substructures;
(4-2-3) adding a vertex (namely an edge adjacent to all nodes in all the substructures in the step (4-2-2)), and expanding the optimal substructures obtained in the step (4-2-2) by using the vertex to obtain expanded substructures as current substructures;
(4-2-4) repeating the step (4-2-2) and the step (4-3-3) until all the optimal substructures are obtained, wherein all the optimal substructures form the frequently-appearing substructures in the directed graph.
Preferably, the score of the sub-structure is I (S) + I (G | S), where S denotes the sub-structure in the directed graph G, (G | S) denotes graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) denotes the description length of the directed graph where the sub-structure S is located, and I (G | S) denotes the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.
Preferably, the length i (S) of the description of the directed graph in which the substructure S lies is equal to:
I(S)=v+r+e
where v represents the number of bits required to construct the vertex label of the directed graph in which the substructure S lies:
v=lgv+v lg(lu)
wherein luRepresenting the set of all vertices of the directed graph. r represents the number of bits required for each row of the adjacency matrix a in converting the directed graph into the adjacency matrix a, and has:
where b is max (ki))Max denotes taking the maximum value, kiRepresents the number of 1 in the ith row of the adjacency matrix a;
e represents the number of bits required for the edge represented by a [ i, j ] ═ 1 in the adjacency matrix a, that is, the number of bits required for all the edges in the graph is stored, and there are:
where m denotes the size of the edge of the adjacency matrix a represented by a [ i, j ] ═ 1, and u denotes the vertex of the adjacency matrix a.
According to another aspect of the present invention, there is provided a root cause localization system based on machine learning, including:
the first module is used for acquiring call chain data consisting of data of a call process in the micro-service application system;
the second module is used for acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
the third module is used for inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired by the second module into the trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the fourth module, otherwise, ending the process;
and the fourth module is used for carrying out root cause detection on the detection result obtained by the third module so as to obtain the node with the fault and the performance index causing the fault.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) because the step (3) is adopted, and a machine learning method is adopted, the characteristics can be directly input to obtain the result of whether the fault occurs, and the technical problem of low accuracy of the existing root cause detection method based on static threshold setting can be solved;
(2) because the step (4-2) and the step (4-4) are adopted, the frequent subgraph mining method is used, and other parameters are not required to be set, the technical problem that the periodic characteristic of the actual data index is difficult to identify by the root cause detection method based on the sliding window can be solved;
(3) according to the method, the step (1) and the step (3) are adopted, so that a plurality of indexes of the micro-service application system are collected and trained to obtain a trained network, and the technical problem that a sliding window-based root cause detection method can only identify single index abnormity and has no good interpretability can be solved;
(4) because the invention adopts the step (4-3) and the step (4-4), the invention uses the frequent subgraph mining algorithm to mine the frequently appearing substructure, and compares the frequently appearing substructure with other substructure in the graph database to obtain the root cause node, thereby solving the technical problem that the model can not be well trained when the abnormal data is too little.
Drawings
FIG. 1 is a flow chart of the root cause location method based on machine learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Root cause positioning is an important and difficult-to-implement field of intelligent operation and maintenance (AIOPS for short), and relates to the mutual combination of induction analysis and deductive reasoning, and the comprehensive application of reasoning from a large number theorem to a logic complete chain. The massive data of the micro-service architecture lays a foundation for correlation analysis, but the business abnormal cases are very lacking, so that the micro-service architecture needs to have strong AI (artificial intelligence) capability from correlation to causality: deductive reasoning is carried out based on the operation and maintenance domain knowledge, and meanwhile, the process and conclusion of causal reasoning are interpretable so as to facilitate repeated analysis and continuous optimization. The method adopts a method based on frequent subgraph division, so that non-abnormal data can be used in the root cause positioning process, the analyzed data source is increased, and the interpretability is better.
As shown in fig. 1, the present invention provides a root cause localization method based on machine learning, which comprises the following steps:
(1) acquiring call chain data consisting of data of a call process in the micro-service application system;
specifically, the data of the calling procedure specifically includes a timestamp and a call chain id (which are in one-to-one correspondence) of the data, a call type, a service execution time, a caller id, a data id, and a name of the micro service application system.
(2) Acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
specifically, the service index data of the micro-service application system comprises the name, the timestamp, the average calling time, the service volume, the success number and the success rate of the micro-service system; the purpose of acquiring container indexes, middleware indexes, host indexes and database index data is to locate specific abnormal performance indexes;
(3) inputting the timestamp, the average calling time, the traffic volume, the success number and the success rate in the service index data acquired in the step (2) into a trained Support Vector Machine (SVM) network to obtain a detection result, judging whether the detection result is abnormal or not, entering the step (4) if the detection result is abnormal, and ending the process if the detection result is not abnormal;
the method has the advantages that the root cause detection is carried out by using multiple indexes, the multiple indexes are integrated, and the accuracy is improved.
Specifically, the SVM network in this step is obtained by training the following substeps:
(3-1) acquiring service index data of the micro-service application system, sequentially carrying out normalization and data annotation processing on the acquired service index data, and dividing the processed service index data serving as a data set into a training set and a test set;
specifically, the data set adopted in this step is service index data collected in the same micro service application system, and specifically, as shown in step (2), the data set is calculated according to the following formula (7): 3 into training set and test set, i.e. randomly dividing 70% as training set and the remaining 30% as test set.
In the labeling process of this step, the result is labeled as a real number vector, for example, an abnormal condition is labeled as 1, and no abnormal condition is labeled as-1.
And (3-2) initializing the parameters of the SVM network to obtain the initialized SVM network.
Specifically, in this step, a penalty coefficient C of the SVM network is set to 1.0, a kernel function kernel is set to a linear kernel function, and a weighting parameter class is set to a ratio of positive and negative samples in a data set, which is specifically represented by a list: [ number of abnormal samples, number of abnormal samples ];
(3-3) inputting the training set obtained in the step (3-1) into the SVM network initialized in the step (3-2) for training to obtain a preliminarily trained SVM network;
and (3-4) testing the SVM network preliminarily trained in the step (3-3) by using the test set obtained in the step (3-1) to obtain a finally trained SVM network.
The steps (3-1) to (3-4) have the advantages that the SVM network classification is adopted, the multi-index problem is solved, and the effect is relatively good when the data volume is small.
(4) And (4) carrying out root cause detection on the detection result obtained in the step (3) to obtain the node where the fault occurs and the performance index causing the fault.
Specifically, step (4) includes the following substeps:
(4-1) using the timestamp in the detection result obtained in the step (3), inquiring a calling chain id corresponding to the timestamp in the calling chain data obtained in the step (1), using the calling chain id to obtain all the calling chain ids in the calling chain data which are equal to the inquired calling chain id, and sequentially establishing a directed edge from the data id in each piece of obtained data to the calling chain id to obtain a directed graph;
(4-2) carrying out frequent subgraph mining on the directed graph obtained in the step (4-1) to obtain a frequently-appearing substructure in the directed graph;
specifically, this step includes the following substeps:
(4-2-1) initializing each node in the directed graph as a sub-structure;
(4-2-2) calculating scores of all current substructures, and selecting the substructures with the lowest scores as the optimal substructures;
specifically, the score of the sub-structure is I (S) + I (G | S), where S denotes the sub-structure in the directed graph G, (G | S) denotes graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) denotes the description length of the directed graph in which the sub-structure S is located, and I (G | S) denotes the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.
The calculation of the description length i (S) of the directed graph in which the substructure S is located can be divided into three parts, namely:
I(S)=v+r+e
where v represents the number of bits required to construct the vertex label of the directed graph in which the substructure S lies:
v=lgv+v lg(lu)
where lu represents the set of all vertices of the directed graph. r represents the number of bits required for each row of adjacency matrix a in converting the directed graph into adjacency matrix a (another representation of the graph), and has:
where b is max (k)i) Max denotes taking the maximum value, kiIndicates the number of 1's in the ith row of the adjacency matrix a.
e represents the number of bits required for an edge represented by a [ i, j ] ═ 1 in the adjacency matrix a (that is, the number of bits required for all edges in the graph is stored).
Where m denotes the size of the edge of the adjacency matrix a represented by a [ i, j ] ═ 1, and u denotes the vertex of the adjacency matrix a.
(4-2-3) adding a vertex (namely an edge adjacent to all nodes in all the substructures in the step (4-2-2)), and expanding the optimal substructures obtained in the step (4-2-2) by using the vertex to obtain expanded substructures as current substructures;
(4-2-4) repeating the step (4-2-2) and the step (4-3-3) until all the optimal substructures are obtained, wherein all the optimal substructures form the frequently-appearing substructures in the directed graph.
The above steps (4-2-1) to (4-2-4) have advantages in that the interpretability of the root cause localization is increased and the use of call chain data without abnormality is increased.
(4-3) establishing a graph database by using frequently-occurring substructures in the directed graph obtained in the step (4-2);
specifically, in this step, the frequently-appearing substructures in the directed graph obtained in step (4-2) are numbered sequentially, then the substructures in which abnormal nodes exist are manually marked as the abnormal nodes, and if the abnormal nodes do not exist in the substructures, the substructures are marked as null.
(4-4) acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as having abnormal nodes in the graph database as a first set, acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as null in the graph database as a second set, and then taking the intersection set between the first set and the second set as an abnormal root cause node;
(4-5) taking the root node of each abnormality obtained in the step (4-4) as an input node, intensively inquiring four numerical values corresponding to the input node in the container index, the middleware index, the host index and the database index obtained in the step (2), calculating the change rate, the first-order difference, the second-order difference and the average change rate of the sliding window of the four numerical values as characteristic values, keeping characteristic values exceeding a set threshold value (set by artificial experience according to the size of an index data set in a system) through a Robust Random Cut Forest (RRCF) algorithm, the others are zeroed to form relatively sparse feature vectors, and the feature vectors are classified by K-Nearest neighbor (KNN) with K ═ 1 to determine the final root cause performance index (i.e., which of the container, middleware, host, or database is abnormal).
Specifically, to make the change of the feature anomaly value more obvious, the RRCF algorithm is selected to convert the feature value into an RRCF score.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (9)
1. A root cause positioning method based on machine learning is characterized by comprising the following steps:
(1) acquiring call chain data consisting of data of a call process in the micro-service application system;
(2) acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
(3) inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired in the step (2) into a trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the step (4), otherwise, ending the process;
(4) and (4) carrying out root cause detection on the detection result obtained in the step (3) to obtain the node where the fault occurs and the performance index causing the fault.
2. The machine-learning based root cause localization method of claim 1,
the data of the calling process comprises a time stamp and a calling chain id of the data, a calling type, a service execution time, a caller id, a data id and a name of the micro-service application system.
The service index data of the micro-service application system comprises the name, the timestamp, the average calling time, the service volume, the success number and the success rate of the micro-service system; the purpose of obtaining container index, middleware index, host index and database index data is to locate specific abnormal performance index.
3. The root cause localization method based on machine learning of claim 1 or 2, characterized in that the SVM network is trained by the following sub-steps:
(3-1) acquiring service index data of the micro-service application system, sequentially carrying out normalization and data annotation processing on the acquired service index data, and dividing the processed service index data serving as a data set into a training set and a test set;
and (3-2) initializing the parameters of the SVM network to obtain the initialized SVM network.
(3-3) inputting the training set obtained in the step (3-1) into the SVM network initialized in the step (3-2) for training to obtain a preliminarily trained SVM network;
and (3-4) testing the SVM network preliminarily trained in the step (3-3) by using the test set obtained in the step (3-1) to obtain a finally trained SVM network.
4. The root cause localization method based on machine learning of any one of claims 1 to 3, wherein the step (3-2) is specifically to set the penalty coefficient C of the SVM network to 1.0, set the kernel function kernel to linear kernel function linear, and set the weighting parameter class _ weight to the ratio of positive and negative samples in the data set.
5. The root cause localization method based on machine learning of claim 4, wherein the step (4) comprises the following sub-steps:
(4-1) using the timestamp in the detection result obtained in the step (3), inquiring a calling chain id corresponding to the timestamp in the calling chain data obtained in the step (1), using the calling chain id to obtain all the calling chain ids in the calling chain data which are equal to the inquired calling chain id, and sequentially establishing a directed edge from the data id in each piece of obtained data to the calling chain id to obtain a directed graph;
(4-2) carrying out frequent subgraph mining on the directed graph obtained in the step (4-1) to obtain a frequently-appearing substructure in the directed graph;
(4-3) establishing a graph database by using frequently-occurring substructures in the directed graph obtained in the step (4-2);
(4-4) acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as having abnormal nodes in the graph database as a first set, acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as null in the graph database as a second set, and then taking the intersection set between the first set and the second set as an abnormal root cause node;
and (4-5) taking the root cause node of each abnormality obtained in the step (4-4) as an input node, centrally inquiring four numerical values corresponding to the input node in the container index, the middleware index, the host index and the database index obtained in the step (2), calculating the change rate, the first-order difference, the second-order difference and the sliding window average change rate of the four numerical values as characteristic values, keeping the characteristic values exceeding a set threshold value through a robust random forest harvesting RRCF algorithm, setting the other values to zero to form a relatively sparse characteristic vector, and classifying the characteristic vector through a K nearest neighbor method with K being 1 to determine the final root cause performance index (namely, which one of the container, the middleware, the host or the database is abnormal), wherein the threshold value is set by artificial experience according to the size of an index data set in the system.
6. The root cause localization method based on machine learning of claim 5, wherein the step (4-2) comprises the following sub-steps:
(4-2-1) initializing each node in the directed graph as a sub-structure;
(4-2-2) calculating scores of all current substructures, and selecting the substructures with the lowest scores as the optimal substructures;
(4-2-3) adding a vertex (namely an edge adjacent to all nodes in all the substructures in the step (4-2-2)), and expanding the optimal substructures obtained in the step (4-2-2) by using the vertex to obtain expanded substructures as current substructures;
(4-2-4) repeating the step (4-2-2) and the step (4-3-3) until all the optimal substructures are obtained, wherein all the optimal substructures form the frequently-appearing substructures in the directed graph.
7. The root cause localization method based on machine learning according to claim 6, wherein the score of the sub-structure is I (S) + I (G | S), where S represents the sub-structure in the directed graph G, (G | S) represents graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) represents the description length of the directed graph where the sub-structure S is located, and I (G | S) represents the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.
8. The machine-learning based root cause localization method of claim 7,
the length of the description of the directed graph in which the substructure S lies, I (S), is equal to:
I(S)=v+r+e
where v represents the number of bits required to construct the vertex label of the directed graph in which the substructure S lies:
v=lgv+v lg(lu)
wherein luRepresenting the set of all vertices of the directed graph. r represents the number of bits required for each row of the adjacency matrix a in converting the directed graph into the adjacency matrix a, and has:
where b is max (k)i) Max denotes taking the maximum value, kiRepresents the number of 1 in the ith row of the adjacency matrix a;
e represents the number of bits required for the edge represented by a [ i, j ] ═ 1 in the adjacency matrix a, that is, the number of bits required for all the edges in the graph is stored, and there are:
where m denotes the size of the edge of the adjacency matrix a represented by a [ i, j ] ═ 1, and u denotes the vertex of the adjacency matrix a.
9. A root cause location system based on machine learning, comprising:
the first module is used for acquiring call chain data consisting of data of a call process in the micro-service application system;
the second module is used for acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
the third module is used for inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired by the second module into the trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the fourth module, otherwise, ending the process;
and the fourth module is used for carrying out root cause detection on the detection result obtained by the third module so as to obtain the node with the fault and the performance index causing the fault.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210089130.1A CN114416423B (en) | 2022-01-25 | 2022-01-25 | Root cause positioning method and system based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210089130.1A CN114416423B (en) | 2022-01-25 | 2022-01-25 | Root cause positioning method and system based on machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114416423A true CN114416423A (en) | 2022-04-29 |
CN114416423B CN114416423B (en) | 2024-08-23 |
Family
ID=81277289
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210089130.1A Active CN114416423B (en) | 2022-01-25 | 2022-01-25 | Root cause positioning method and system based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114416423B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048998A (en) * | 2022-06-13 | 2022-09-13 | 大连理工大学 | Cable-stayed bridge group cable force abnormity identification and positioning method based on monitoring data |
CN115118574A (en) * | 2022-06-07 | 2022-09-27 | 马上消费金融股份有限公司 | Data processing method, device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020140945A1 (en) * | 2019-01-02 | 2020-07-09 | 中国移动通信有限公司研究院 | Container-based virtual resource management method, apparatus, and system |
CN113014421A (en) * | 2021-02-08 | 2021-06-22 | 武汉大学 | Micro-service root cause positioning method for cloud native system |
CN113282635A (en) * | 2021-04-12 | 2021-08-20 | 国电南瑞科技股份有限公司 | Micro-service system fault root cause positioning method and device |
-
2022
- 2022-01-25 CN CN202210089130.1A patent/CN114416423B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020140945A1 (en) * | 2019-01-02 | 2020-07-09 | 中国移动通信有限公司研究院 | Container-based virtual resource management method, apparatus, and system |
CN113014421A (en) * | 2021-02-08 | 2021-06-22 | 武汉大学 | Micro-service root cause positioning method for cloud native system |
CN113282635A (en) * | 2021-04-12 | 2021-08-20 | 国电南瑞科技股份有限公司 | Micro-service system fault root cause positioning method and device |
Non-Patent Citations (1)
Title |
---|
陈立忠;: "基于机器学习的智能化自动化运维", 中国新通信, no. 14, 20 July 2020 (2020-07-20) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115118574A (en) * | 2022-06-07 | 2022-09-27 | 马上消费金融股份有限公司 | Data processing method, device and storage medium |
CN115048998A (en) * | 2022-06-13 | 2022-09-13 | 大连理工大学 | Cable-stayed bridge group cable force abnormity identification and positioning method based on monitoring data |
Also Published As
Publication number | Publication date |
---|---|
CN114416423B (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8630962B2 (en) | Error detection method and its system for early detection of errors in a planar or facilities | |
CN107003992B (en) | Perceptual associative memory for neural language behavior recognition systems | |
CN114416423B (en) | Root cause positioning method and system based on machine learning | |
CN107111610B (en) | Mapper component for neuro-linguistic behavior recognition systems | |
US20240070388A1 (en) | Lexical analyzer for a neuro-linguistic behavior recognition system | |
EP1958034B1 (en) | Use of sequential clustering for instance selection in machine condition monitoring | |
CN114861788A (en) | Load abnormity detection method and system based on DBSCAN clustering | |
CN108470022A (en) | A kind of intelligent work order quality detecting method based on operation management | |
CN114296975A (en) | Distributed system call chain and log fusion anomaly detection method | |
CN117421994A (en) | Edge application health monitoring method and system | |
CN112465045A (en) | Supply chain exception event detection method based on twin neural network | |
Ali et al. | Fake accounts detection on social media using stack ensemble system | |
CN114897085A (en) | Clustering method based on closed subgraph link prediction and computer equipment | |
CN116756225B (en) | Situation data information processing method based on computer network security | |
Stržinar et al. | Soft sensor for non-invasive detection of process events based on Eigenresponse Fuzzy Clustering | |
CN113535522A (en) | Abnormal condition detection method, device and equipment | |
Gias et al. | Samplehst: Efficient on-the-fly selection of distributed traces | |
Rahman et al. | Performance analysis of the imbalanced data method on increasing the classification accuracy of the machine learning hybrid method | |
CN117170922A (en) | Log data analysis method, device, terminal equipment and storage medium | |
CN116708152A (en) | Method and system for positioning fault root cause of wireless network equipment based on machine learning | |
CN114528906A (en) | Fault diagnosis method, device, equipment and medium for rotary machine | |
CN115278752A (en) | AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system | |
Ip et al. | ML-assisted monitoring and characterization of IoT sensor networks | |
CN114124676B (en) | Fault root positioning method and system for network intelligent operation and maintenance system | |
CN118211154B (en) | Class increment service identification method and system based on continuous learning improvement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |