CN114416423A - Root cause positioning method and system based on machine learning - Google Patents

Root cause positioning method and system based on machine learning Download PDF

Info

Publication number
CN114416423A
CN114416423A CN202210089130.1A CN202210089130A CN114416423A CN 114416423 A CN114416423 A CN 114416423A CN 202210089130 A CN202210089130 A CN 202210089130A CN 114416423 A CN114416423 A CN 114416423A
Authority
CN
China
Prior art keywords
data
index
root cause
directed graph
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210089130.1A
Other languages
Chinese (zh)
Other versions
CN114416423B (en
Inventor
唐卓
向婷
李肯立
李虹宇
伍祚瑶
王啸
罗文明
程欣威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202210089130.1A priority Critical patent/CN114416423B/en
Publication of CN114416423A publication Critical patent/CN114416423A/en
Application granted granted Critical
Publication of CN114416423B publication Critical patent/CN114416423B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a root cause positioning method based on machine learning, which comprises the following steps: acquiring call chain data consisting of data of a call process in the micro-service application system, and acquiring service index data, container index, middleware index, host index and database index data of the micro-service application system; inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the acquired service index data into a trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, and if so, performing root cause detection on the obtained detection result to obtain a node where the fault occurs and a performance index which causes the fault. The method can solve the technical problems that the existing root cause detection method based on static threshold setting is low in accuracy rate and the existing root cause detection method based on the sliding window is difficult to identify the periodic characteristics of actual data indexes.

Description

Root cause positioning method and system based on machine learning
Technical Field
The invention belongs to the technical field of intelligent operation and maintenance, and particularly relates to a root cause positioning method and system based on machine learning.
Background
A large Internet company provides services to the outside through a service cluster, meanwhile, business services are bloated along with the increase of product requirements, the large-scale services are split on the structure, the large-scale services are split into small-size independent services, and each small service is managed by an independent process to provide the services to the outside, namely 'micro-services'.
The microservice application system uses a microservice architecture to build applications as independent components and run each application process as a service. These services communicate over well-defined interfaces using lightweight APIs. These services are built around business functions, each performing a function independently.
After the micro service architecture is started, a plurality of services become distributed, after the services are split, the requests of users can be processed through different service nodes, and the results are returned to the users. Then, if any node has a problem in the whole call chain, the final result may be abnormal. In such a complex environment, it is not easy to find out a specific service node accurately and efficiently. Therefore, under the background of the call chain, each node through which the request passes is recorded, a complete call chain monitoring system is formed, and error links are checked according to the call chain log.
Most of the root cause detection methods of the existing micro-service application systems adopt a threshold alarm setting method, which is specifically divided into a root cause detection method based on static threshold setting and a root cause detection method based on a sliding window.
The root cause detection method based on static threshold setting means that abnormality is detected when a certain fixed threshold is exceeded. However, the accuracy of this method is low, because the anomaly threshold may change with time, and the fixed threshold cannot cover all scenes, which may result in some specific scenes not being detected; the fixed threshold problem is solved based on root cause detection of a sliding window, a time sequence can be framed according to a specified unit length, and therefore a statistical index in a frame is calculated, but the periodic characteristic of an actual data index is difficult to identify; furthermore, such unsupervised threshold alarm settings can only identify single-index anomalies and are not well interpretable.
Disclosure of Invention
In view of the above defects or improvement requirements of the prior art, the present invention provides a root cause positioning method and system based on machine learning, and aims to solve the technical problems that the accuracy of the existing root cause detection method based on static threshold setting is low, the existing root cause detection method based on sliding window is difficult to identify the periodic characteristics of the actual data indexes, and the existing root cause detection method based on sliding window is only capable of identifying single index abnormality and has no good interpretability.
To achieve the above object, according to one aspect of the present invention, there is provided a root cause positioning method based on machine learning, including the steps of:
(1) acquiring call chain data consisting of data of a call process in the micro-service application system;
(2) acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
(3) inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired in the step (2) into a trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the step (4), otherwise, ending the process;
(4) and (4) carrying out root cause detection on the detection result obtained in the step (3) to obtain the node where the fault occurs and the performance index causing the fault.
Preferably, the data of the calling procedure includes a timestamp and a call chain id of the data, a call type, a service execution time, a caller id, a data id, and a name of the micro service application system.
The service index data of the micro-service application system comprises the name, the timestamp, the average calling time, the service volume, the success number and the success rate of the micro-service system; the purpose of obtaining container index, middleware index, host index and database index data is to locate specific abnormal performance index.
Preferably, the SVM network is trained by the following substeps:
(3-1) acquiring service index data of the micro-service application system, sequentially carrying out normalization and data annotation processing on the acquired service index data, and dividing the processed service index data serving as a data set into a training set and a test set;
and (3-2) initializing the parameters of the SVM network to obtain the initialized SVM network.
(3-3) inputting the training set obtained in the step (3-1) into the SVM network initialized in the step (3-2) for training to obtain a preliminarily trained SVM network;
and (3-4) testing the SVM network preliminarily trained in the step (3-3) by using the test set obtained in the step (3-1) to obtain a finally trained SVM network.
Preferably, the step (3-2) is specifically to set the penalty coefficient C of the SVM network to 1.0, set the kernel function kernel to a linear kernel function linear, and set the weighting parameter class _ weight to the proportion of positive and negative samples in the data set.
Preferably, step (4) comprises the sub-steps of:
(4-1) using the timestamp in the detection result obtained in the step (3), inquiring a calling chain id corresponding to the timestamp in the calling chain data obtained in the step (1), using the calling chain id to obtain all the calling chain ids in the calling chain data which are equal to the inquired calling chain id, and sequentially establishing a directed edge from the data id in each piece of obtained data to the calling chain id to obtain a directed graph;
(4-2) carrying out frequent subgraph mining on the directed graph obtained in the step (4-1) to obtain a frequently-appearing substructure in the directed graph;
(4-3) establishing a graph database by using frequently-occurring substructures in the directed graph obtained in the step (4-2);
(4-4) acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as having abnormal nodes in the graph database as a first set, acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as null in the graph database as a second set, and then taking the intersection set between the first set and the second set as an abnormal root cause node;
and (4-5) taking the root cause node of each abnormality obtained in the step (4-4) as an input node, centrally inquiring four numerical values corresponding to the input node in the container index, the middleware index, the host index and the database index obtained in the step (2), calculating the change rate, the first-order difference, the second-order difference and the sliding window average change rate of the four numerical values as characteristic values, keeping the characteristic values exceeding a set threshold value through a robust random forest harvesting RRCF algorithm, setting the other values to zero to form a relatively sparse characteristic vector, and classifying the characteristic vector through a K nearest neighbor method with K being 1 to determine the final root cause performance index (namely, which one of the container, the middleware, the host or the database is abnormal), wherein the threshold value is set by artificial experience according to the size of an index data set in the system.
Preferably, step (4-2) comprises the sub-steps of:
(4-2-1) initializing each node in the directed graph as a sub-structure;
(4-2-2) calculating scores of all current substructures, and selecting the substructures with the lowest scores as the optimal substructures;
(4-2-3) adding a vertex (namely an edge adjacent to all nodes in all the substructures in the step (4-2-2)), and expanding the optimal substructures obtained in the step (4-2-2) by using the vertex to obtain expanded substructures as current substructures;
(4-2-4) repeating the step (4-2-2) and the step (4-3-3) until all the optimal substructures are obtained, wherein all the optimal substructures form the frequently-appearing substructures in the directed graph.
Preferably, the score of the sub-structure is I (S) + I (G | S), where S denotes the sub-structure in the directed graph G, (G | S) denotes graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) denotes the description length of the directed graph where the sub-structure S is located, and I (G | S) denotes the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.
Preferably, the length i (S) of the description of the directed graph in which the substructure S lies is equal to:
I(S)=v+r+e
where v represents the number of bits required to construct the vertex label of the directed graph in which the substructure S lies:
v=lgv+v lg(lu)
wherein luRepresenting the set of all vertices of the directed graph. r represents the number of bits required for each row of the adjacency matrix a in converting the directed graph into the adjacency matrix a, and has:
Figure BDA0003486746230000051
where b is max (ki))Max denotes taking the maximum value, kiRepresents the number of 1 in the ith row of the adjacency matrix a;
e represents the number of bits required for the edge represented by a [ i, j ] ═ 1 in the adjacency matrix a, that is, the number of bits required for all the edges in the graph is stored, and there are:
Figure BDA0003486746230000052
where m denotes the size of the edge of the adjacency matrix a represented by a [ i, j ] ═ 1, and u denotes the vertex of the adjacency matrix a.
According to another aspect of the present invention, there is provided a root cause localization system based on machine learning, including:
the first module is used for acquiring call chain data consisting of data of a call process in the micro-service application system;
the second module is used for acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
the third module is used for inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired by the second module into the trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the fourth module, otherwise, ending the process;
and the fourth module is used for carrying out root cause detection on the detection result obtained by the third module so as to obtain the node with the fault and the performance index causing the fault.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) because the step (3) is adopted, and a machine learning method is adopted, the characteristics can be directly input to obtain the result of whether the fault occurs, and the technical problem of low accuracy of the existing root cause detection method based on static threshold setting can be solved;
(2) because the step (4-2) and the step (4-4) are adopted, the frequent subgraph mining method is used, and other parameters are not required to be set, the technical problem that the periodic characteristic of the actual data index is difficult to identify by the root cause detection method based on the sliding window can be solved;
(3) according to the method, the step (1) and the step (3) are adopted, so that a plurality of indexes of the micro-service application system are collected and trained to obtain a trained network, and the technical problem that a sliding window-based root cause detection method can only identify single index abnormity and has no good interpretability can be solved;
(4) because the invention adopts the step (4-3) and the step (4-4), the invention uses the frequent subgraph mining algorithm to mine the frequently appearing substructure, and compares the frequently appearing substructure with other substructure in the graph database to obtain the root cause node, thereby solving the technical problem that the model can not be well trained when the abnormal data is too little.
Drawings
FIG. 1 is a flow chart of the root cause location method based on machine learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Root cause positioning is an important and difficult-to-implement field of intelligent operation and maintenance (AIOPS for short), and relates to the mutual combination of induction analysis and deductive reasoning, and the comprehensive application of reasoning from a large number theorem to a logic complete chain. The massive data of the micro-service architecture lays a foundation for correlation analysis, but the business abnormal cases are very lacking, so that the micro-service architecture needs to have strong AI (artificial intelligence) capability from correlation to causality: deductive reasoning is carried out based on the operation and maintenance domain knowledge, and meanwhile, the process and conclusion of causal reasoning are interpretable so as to facilitate repeated analysis and continuous optimization. The method adopts a method based on frequent subgraph division, so that non-abnormal data can be used in the root cause positioning process, the analyzed data source is increased, and the interpretability is better.
As shown in fig. 1, the present invention provides a root cause localization method based on machine learning, which comprises the following steps:
(1) acquiring call chain data consisting of data of a call process in the micro-service application system;
specifically, the data of the calling procedure specifically includes a timestamp and a call chain id (which are in one-to-one correspondence) of the data, a call type, a service execution time, a caller id, a data id, and a name of the micro service application system.
(2) Acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
specifically, the service index data of the micro-service application system comprises the name, the timestamp, the average calling time, the service volume, the success number and the success rate of the micro-service system; the purpose of acquiring container indexes, middleware indexes, host indexes and database index data is to locate specific abnormal performance indexes;
(3) inputting the timestamp, the average calling time, the traffic volume, the success number and the success rate in the service index data acquired in the step (2) into a trained Support Vector Machine (SVM) network to obtain a detection result, judging whether the detection result is abnormal or not, entering the step (4) if the detection result is abnormal, and ending the process if the detection result is not abnormal;
the method has the advantages that the root cause detection is carried out by using multiple indexes, the multiple indexes are integrated, and the accuracy is improved.
Specifically, the SVM network in this step is obtained by training the following substeps:
(3-1) acquiring service index data of the micro-service application system, sequentially carrying out normalization and data annotation processing on the acquired service index data, and dividing the processed service index data serving as a data set into a training set and a test set;
specifically, the data set adopted in this step is service index data collected in the same micro service application system, and specifically, as shown in step (2), the data set is calculated according to the following formula (7): 3 into training set and test set, i.e. randomly dividing 70% as training set and the remaining 30% as test set.
In the labeling process of this step, the result is labeled as a real number vector, for example, an abnormal condition is labeled as 1, and no abnormal condition is labeled as-1.
And (3-2) initializing the parameters of the SVM network to obtain the initialized SVM network.
Specifically, in this step, a penalty coefficient C of the SVM network is set to 1.0, a kernel function kernel is set to a linear kernel function, and a weighting parameter class is set to a ratio of positive and negative samples in a data set, which is specifically represented by a list: [ number of abnormal samples, number of abnormal samples ];
(3-3) inputting the training set obtained in the step (3-1) into the SVM network initialized in the step (3-2) for training to obtain a preliminarily trained SVM network;
and (3-4) testing the SVM network preliminarily trained in the step (3-3) by using the test set obtained in the step (3-1) to obtain a finally trained SVM network.
The steps (3-1) to (3-4) have the advantages that the SVM network classification is adopted, the multi-index problem is solved, and the effect is relatively good when the data volume is small.
(4) And (4) carrying out root cause detection on the detection result obtained in the step (3) to obtain the node where the fault occurs and the performance index causing the fault.
Specifically, step (4) includes the following substeps:
(4-1) using the timestamp in the detection result obtained in the step (3), inquiring a calling chain id corresponding to the timestamp in the calling chain data obtained in the step (1), using the calling chain id to obtain all the calling chain ids in the calling chain data which are equal to the inquired calling chain id, and sequentially establishing a directed edge from the data id in each piece of obtained data to the calling chain id to obtain a directed graph;
(4-2) carrying out frequent subgraph mining on the directed graph obtained in the step (4-1) to obtain a frequently-appearing substructure in the directed graph;
specifically, this step includes the following substeps:
(4-2-1) initializing each node in the directed graph as a sub-structure;
(4-2-2) calculating scores of all current substructures, and selecting the substructures with the lowest scores as the optimal substructures;
specifically, the score of the sub-structure is I (S) + I (G | S), where S denotes the sub-structure in the directed graph G, (G | S) denotes graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) denotes the description length of the directed graph in which the sub-structure S is located, and I (G | S) denotes the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.
The calculation of the description length i (S) of the directed graph in which the substructure S is located can be divided into three parts, namely:
I(S)=v+r+e
where v represents the number of bits required to construct the vertex label of the directed graph in which the substructure S lies:
v=lgv+v lg(lu)
where lu represents the set of all vertices of the directed graph. r represents the number of bits required for each row of adjacency matrix a in converting the directed graph into adjacency matrix a (another representation of the graph), and has:
Figure BDA0003486746230000091
where b is max (k)i) Max denotes taking the maximum value, kiIndicates the number of 1's in the ith row of the adjacency matrix a.
e represents the number of bits required for an edge represented by a [ i, j ] ═ 1 in the adjacency matrix a (that is, the number of bits required for all edges in the graph is stored).
Figure BDA0003486746230000092
Where m denotes the size of the edge of the adjacency matrix a represented by a [ i, j ] ═ 1, and u denotes the vertex of the adjacency matrix a.
(4-2-3) adding a vertex (namely an edge adjacent to all nodes in all the substructures in the step (4-2-2)), and expanding the optimal substructures obtained in the step (4-2-2) by using the vertex to obtain expanded substructures as current substructures;
(4-2-4) repeating the step (4-2-2) and the step (4-3-3) until all the optimal substructures are obtained, wherein all the optimal substructures form the frequently-appearing substructures in the directed graph.
The above steps (4-2-1) to (4-2-4) have advantages in that the interpretability of the root cause localization is increased and the use of call chain data without abnormality is increased.
(4-3) establishing a graph database by using frequently-occurring substructures in the directed graph obtained in the step (4-2);
specifically, in this step, the frequently-appearing substructures in the directed graph obtained in step (4-2) are numbered sequentially, then the substructures in which abnormal nodes exist are manually marked as the abnormal nodes, and if the abnormal nodes do not exist in the substructures, the substructures are marked as null.
(4-4) acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as having abnormal nodes in the graph database as a first set, acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as null in the graph database as a second set, and then taking the intersection set between the first set and the second set as an abnormal root cause node;
(4-5) taking the root node of each abnormality obtained in the step (4-4) as an input node, intensively inquiring four numerical values corresponding to the input node in the container index, the middleware index, the host index and the database index obtained in the step (2), calculating the change rate, the first-order difference, the second-order difference and the average change rate of the sliding window of the four numerical values as characteristic values, keeping characteristic values exceeding a set threshold value (set by artificial experience according to the size of an index data set in a system) through a Robust Random Cut Forest (RRCF) algorithm, the others are zeroed to form relatively sparse feature vectors, and the feature vectors are classified by K-Nearest neighbor (KNN) with K ═ 1 to determine the final root cause performance index (i.e., which of the container, middleware, host, or database is abnormal).
Specifically, to make the change of the feature anomaly value more obvious, the RRCF algorithm is selected to convert the feature value into an RRCF score.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A root cause positioning method based on machine learning is characterized by comprising the following steps:
(1) acquiring call chain data consisting of data of a call process in the micro-service application system;
(2) acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
(3) inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired in the step (2) into a trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the step (4), otherwise, ending the process;
(4) and (4) carrying out root cause detection on the detection result obtained in the step (3) to obtain the node where the fault occurs and the performance index causing the fault.
2. The machine-learning based root cause localization method of claim 1,
the data of the calling process comprises a time stamp and a calling chain id of the data, a calling type, a service execution time, a caller id, a data id and a name of the micro-service application system.
The service index data of the micro-service application system comprises the name, the timestamp, the average calling time, the service volume, the success number and the success rate of the micro-service system; the purpose of obtaining container index, middleware index, host index and database index data is to locate specific abnormal performance index.
3. The root cause localization method based on machine learning of claim 1 or 2, characterized in that the SVM network is trained by the following sub-steps:
(3-1) acquiring service index data of the micro-service application system, sequentially carrying out normalization and data annotation processing on the acquired service index data, and dividing the processed service index data serving as a data set into a training set and a test set;
and (3-2) initializing the parameters of the SVM network to obtain the initialized SVM network.
(3-3) inputting the training set obtained in the step (3-1) into the SVM network initialized in the step (3-2) for training to obtain a preliminarily trained SVM network;
and (3-4) testing the SVM network preliminarily trained in the step (3-3) by using the test set obtained in the step (3-1) to obtain a finally trained SVM network.
4. The root cause localization method based on machine learning of any one of claims 1 to 3, wherein the step (3-2) is specifically to set the penalty coefficient C of the SVM network to 1.0, set the kernel function kernel to linear kernel function linear, and set the weighting parameter class _ weight to the ratio of positive and negative samples in the data set.
5. The root cause localization method based on machine learning of claim 4, wherein the step (4) comprises the following sub-steps:
(4-1) using the timestamp in the detection result obtained in the step (3), inquiring a calling chain id corresponding to the timestamp in the calling chain data obtained in the step (1), using the calling chain id to obtain all the calling chain ids in the calling chain data which are equal to the inquired calling chain id, and sequentially establishing a directed edge from the data id in each piece of obtained data to the calling chain id to obtain a directed graph;
(4-2) carrying out frequent subgraph mining on the directed graph obtained in the step (4-1) to obtain a frequently-appearing substructure in the directed graph;
(4-3) establishing a graph database by using frequently-occurring substructures in the directed graph obtained in the step (4-2);
(4-4) acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as having abnormal nodes in the graph database as a first set, acquiring an intersection set between the frequently-occurring substructure in the directed graph obtained in the step (4-2) and all nodes in the substructure marked as null in the graph database as a second set, and then taking the intersection set between the first set and the second set as an abnormal root cause node;
and (4-5) taking the root cause node of each abnormality obtained in the step (4-4) as an input node, centrally inquiring four numerical values corresponding to the input node in the container index, the middleware index, the host index and the database index obtained in the step (2), calculating the change rate, the first-order difference, the second-order difference and the sliding window average change rate of the four numerical values as characteristic values, keeping the characteristic values exceeding a set threshold value through a robust random forest harvesting RRCF algorithm, setting the other values to zero to form a relatively sparse characteristic vector, and classifying the characteristic vector through a K nearest neighbor method with K being 1 to determine the final root cause performance index (namely, which one of the container, the middleware, the host or the database is abnormal), wherein the threshold value is set by artificial experience according to the size of an index data set in the system.
6. The root cause localization method based on machine learning of claim 5, wherein the step (4-2) comprises the following sub-steps:
(4-2-1) initializing each node in the directed graph as a sub-structure;
(4-2-2) calculating scores of all current substructures, and selecting the substructures with the lowest scores as the optimal substructures;
(4-2-3) adding a vertex (namely an edge adjacent to all nodes in all the substructures in the step (4-2-2)), and expanding the optimal substructures obtained in the step (4-2-2) by using the vertex to obtain expanded substructures as current substructures;
(4-2-4) repeating the step (4-2-2) and the step (4-3-3) until all the optimal substructures are obtained, wherein all the optimal substructures form the frequently-appearing substructures in the directed graph.
7. The root cause localization method based on machine learning according to claim 6, wherein the score of the sub-structure is I (S) + I (G | S), where S represents the sub-structure in the directed graph G, (G | S) represents graph data obtained by replacing the sub-structure S with a single vertex in the directed graph G, I (S) represents the description length of the directed graph where the sub-structure S is located, and I (G | S) represents the description length of the directed graph obtained by replacing the sub-structure S with a single vertex in the directed graph G.
8. The machine-learning based root cause localization method of claim 7,
the length of the description of the directed graph in which the substructure S lies, I (S), is equal to:
I(S)=v+r+e
where v represents the number of bits required to construct the vertex label of the directed graph in which the substructure S lies:
v=lgv+v lg(lu)
wherein luRepresenting the set of all vertices of the directed graph. r represents the number of bits required for each row of the adjacency matrix a in converting the directed graph into the adjacency matrix a, and has:
Figure FDA0003486746220000041
where b is max (k)i) Max denotes taking the maximum value, kiRepresents the number of 1 in the ith row of the adjacency matrix a;
e represents the number of bits required for the edge represented by a [ i, j ] ═ 1 in the adjacency matrix a, that is, the number of bits required for all the edges in the graph is stored, and there are:
Figure FDA0003486746220000042
where m denotes the size of the edge of the adjacency matrix a represented by a [ i, j ] ═ 1, and u denotes the vertex of the adjacency matrix a.
9. A root cause location system based on machine learning, comprising:
the first module is used for acquiring call chain data consisting of data of a call process in the micro-service application system;
the second module is used for acquiring service index data, container indexes, middleware indexes, host indexes and database index data of the micro-service application system;
the third module is used for inputting the timestamp, the average calling time, the traffic, the success quantity and the success rate in the service index data acquired by the second module into the trained SVM network to obtain a detection result, judging whether the detection result is abnormal or not, if so, entering the fourth module, otherwise, ending the process;
and the fourth module is used for carrying out root cause detection on the detection result obtained by the third module so as to obtain the node with the fault and the performance index causing the fault.
CN202210089130.1A 2022-01-25 2022-01-25 Root cause positioning method and system based on machine learning Active CN114416423B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210089130.1A CN114416423B (en) 2022-01-25 2022-01-25 Root cause positioning method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210089130.1A CN114416423B (en) 2022-01-25 2022-01-25 Root cause positioning method and system based on machine learning

Publications (2)

Publication Number Publication Date
CN114416423A true CN114416423A (en) 2022-04-29
CN114416423B CN114416423B (en) 2024-08-23

Family

ID=81277289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210089130.1A Active CN114416423B (en) 2022-01-25 2022-01-25 Root cause positioning method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN114416423B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048998A (en) * 2022-06-13 2022-09-13 大连理工大学 Cable-stayed bridge group cable force abnormity identification and positioning method based on monitoring data
CN115118574A (en) * 2022-06-07 2022-09-27 马上消费金融股份有限公司 Data processing method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140945A1 (en) * 2019-01-02 2020-07-09 中国移动通信有限公司研究院 Container-based virtual resource management method, apparatus, and system
CN113014421A (en) * 2021-02-08 2021-06-22 武汉大学 Micro-service root cause positioning method for cloud native system
CN113282635A (en) * 2021-04-12 2021-08-20 国电南瑞科技股份有限公司 Micro-service system fault root cause positioning method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140945A1 (en) * 2019-01-02 2020-07-09 中国移动通信有限公司研究院 Container-based virtual resource management method, apparatus, and system
CN113014421A (en) * 2021-02-08 2021-06-22 武汉大学 Micro-service root cause positioning method for cloud native system
CN113282635A (en) * 2021-04-12 2021-08-20 国电南瑞科技股份有限公司 Micro-service system fault root cause positioning method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈立忠;: "基于机器学习的智能化自动化运维", 中国新通信, no. 14, 20 July 2020 (2020-07-20) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118574A (en) * 2022-06-07 2022-09-27 马上消费金融股份有限公司 Data processing method, device and storage medium
CN115048998A (en) * 2022-06-13 2022-09-13 大连理工大学 Cable-stayed bridge group cable force abnormity identification and positioning method based on monitoring data

Also Published As

Publication number Publication date
CN114416423B (en) 2024-08-23

Similar Documents

Publication Publication Date Title
US8630962B2 (en) Error detection method and its system for early detection of errors in a planar or facilities
CN107003992B (en) Perceptual associative memory for neural language behavior recognition systems
CN114416423B (en) Root cause positioning method and system based on machine learning
CN107111610B (en) Mapper component for neuro-linguistic behavior recognition systems
US20240070388A1 (en) Lexical analyzer for a neuro-linguistic behavior recognition system
EP1958034B1 (en) Use of sequential clustering for instance selection in machine condition monitoring
CN114861788A (en) Load abnormity detection method and system based on DBSCAN clustering
CN108470022A (en) A kind of intelligent work order quality detecting method based on operation management
CN114296975A (en) Distributed system call chain and log fusion anomaly detection method
CN117421994A (en) Edge application health monitoring method and system
CN112465045A (en) Supply chain exception event detection method based on twin neural network
Ali et al. Fake accounts detection on social media using stack ensemble system
CN114897085A (en) Clustering method based on closed subgraph link prediction and computer equipment
CN116756225B (en) Situation data information processing method based on computer network security
Stržinar et al. Soft sensor for non-invasive detection of process events based on Eigenresponse Fuzzy Clustering
CN113535522A (en) Abnormal condition detection method, device and equipment
Gias et al. Samplehst: Efficient on-the-fly selection of distributed traces
Rahman et al. Performance analysis of the imbalanced data method on increasing the classification accuracy of the machine learning hybrid method
CN117170922A (en) Log data analysis method, device, terminal equipment and storage medium
CN116708152A (en) Method and system for positioning fault root cause of wireless network equipment based on machine learning
CN114528906A (en) Fault diagnosis method, device, equipment and medium for rotary machine
CN115278752A (en) AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system
Ip et al. ML-assisted monitoring and characterization of IoT sensor networks
CN114124676B (en) Fault root positioning method and system for network intelligent operation and maintenance system
CN118211154B (en) Class increment service identification method and system based on continuous learning improvement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant