CN113900844A - Service code level-based fault root cause positioning method, system and storage medium - Google Patents
Service code level-based fault root cause positioning method, system and storage medium Download PDFInfo
- Publication number
- CN113900844A CN113900844A CN202111127982.7A CN202111127982A CN113900844A CN 113900844 A CN113900844 A CN 113900844A CN 202111127982 A CN202111127982 A CN 202111127982A CN 113900844 A CN113900844 A CN 113900844A
- Authority
- CN
- China
- Prior art keywords
- fault
- root cause
- heterogeneous
- node
- calling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000003860 storage Methods 0.000 title claims abstract description 6
- 238000001514 detection method Methods 0.000 claims abstract description 56
- 230000002159 abnormal effect Effects 0.000 claims abstract description 41
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 28
- 230000005856 abnormality Effects 0.000 claims abstract description 19
- 238000005295 random walk Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 16
- 230000004927 fusion Effects 0.000 claims description 12
- 230000004044 response Effects 0.000 claims description 10
- 238000005457 optimization Methods 0.000 claims description 7
- 230000000007 visual effect Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 5
- 230000004931 aggregating effect Effects 0.000 claims 1
- 238000012545 processing Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 238000013507 mapping Methods 0.000 description 6
- 238000012544 monitoring process Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 5
- 238000009826 distribution Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000002922 simulated annealing Methods 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0775—Content or structure details of the error report, e.g. specific table structure, specific error fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention provides a fault root cause positioning method, a system and a storage medium based on service code level, wherein the method comprises the following steps: constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation; constructing a time series anomaly detection model based on multi-dimensional indexes, and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph; generating a heterogeneous fault map based on the abnormal detection result of each calling edge; and carrying out fault root cause positioning on the obtained heterogeneous fault graph based on a random walk object level sorting algorithm. By adopting the heterogeneous topological graph, the calling relation and the membership relation of the service codes with finer granularity are simply and clearly displayed; by fusing the correlation characteristics of the multi-dimensional indexes, the accuracy of index abnormality detection of the calling edge in the heterogeneous topological structure is effectively improved; the accuracy of fault root cause positioning is effectively improved through a node sorting algorithm of a heterogeneous graph.
Description
Technical Field
The invention relates to fault root cause location, in particular to fault root cause location based on service code level.
Background
With the rapid development of technologies such as cloud computing and service computing and the increasing demand of social production for business, more and more modern enterprises deploy application programs and system services in a cloud computing environment, which are called distributed cloud application programs or micro-services. Compared with the traditional centralized architecture, the distributed architecture has better component expansibility, higher development productivity and lower cost.
To ensure high availability and reliability of the system, application providers must deploy link monitoring systems to collect key performance metrics for each service, such as network response time, service response rate, success rate, etc., to handle complex distributed environments to meet availability constraints and stringent service level objectives. However, with increasingly complex business requirements and increasing micro-service scale, when a fault occurs, a large number of index alarms are generated due to the existence of a cross-system multiple-call dependency relationship, and at this time, a system administrator faces massive alarm index information and is difficult to quickly find a key alarm index and a corresponding fault root cause system thereof only by relying on manual analysis, so that monitoring index data and a system topological relationship need to be automatically processed and analyzed by using a machine learning algorithm, so that a fault root cause system is quickly positioned.
However, most of the existing link tracking and monitoring systems only acquire call relation data between systems, perform fault root cause location based on the call relation of the system level, and do not consider service code key information of system call, so that the existing scheme is difficult to locate the problem of fault root cause of fine granularity, and abnormal information is easily hidden due to data aggregation information of system level coarse granularity.
In addition, due to complexity and periodicity of services, the existing simple anomaly detection strategy based on a fixed threshold or k-sigma has more false alarms or false negatives, for example, the effect of an alarm rule that the response rate is lower than 90% and the time exceeds 3 minutes in different services is not satisfactory, and an ideal effect is difficult to achieve. Most of the current anomaly detection algorithms only perform anomaly detection triggering alarm aiming at a single index, do not consider the complex dependency relationship existing among a plurality of key performance indexes, are easy to cause false alarm, and have high false alarm rate particularly in the scene of index anomaly detection of a fine-grained calling side in a heterogeneous topological structure.
Finally, for a data scene after combining system and service codes, currently, academic circles and industrial circles mostly adopt the same level of call data for analysis, but most of actual scenes involve multiple different levels of call data, and the situation is often more complicated. Therefore, a fault root cause positioning scheme for a converged system and service code needs to be provided.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides:
a fault root cause positioning method based on service code level mainly comprises the following steps:
s1, constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation;
s2, constructing a time series anomaly detection model based on multi-dimensional indexes, and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
s3, generating a heterogeneous fault map based on the abnormal detection result of each calling edge;
s4, based on the random walk object level sorting algorithm, the fault root cause positioning is carried out on the obtained heterogeneous fault graph.
A fault root cause positioning system based on service code level mainly comprises the following modules:
the global heterogeneous topological graph generating module is used for constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation;
the anomaly detection module is used for constructing a time series anomaly detection model based on multi-dimensional indexes and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
the heterogeneous fault map generation module is used for generating a heterogeneous fault map based on the abnormal detection result of each calling edge;
and the fault root cause positioning module is used for positioning the fault root cause of the obtained heterogeneous fault graph based on a random walk object level sorting algorithm.
A storage medium storing a computer program; when the computer program is executed by a processor in a computer device, the computer device performs the method as described in any one of the above.
By constructing a heterogeneous topological graph, the invention simply and clearly shows the calling relation and the membership relation of the service codes with finer granularity; by fusing the correlation characteristics of the multi-dimensional indexes, a time series abnormity detection model based on the multi-dimensional indexes is constructed, the abnormity detection of the calling edge of the global heterogeneous topological graph is realized, and compared with the technical problem of high false alarm rate caused by carrying out abnormity detection only aiming at a single index in the prior art, the accuracy of abnormity detection of the index of the calling edge in the heterogeneous topological structure is effectively improved; further, a heterogeneous fault graph and a root cause system corresponding to the current alarm are obtained through a node sorting algorithm of the heterogeneous graph and combined with automatic processing of a machine learning algorithm, and are simply displayed to the system for subsequent analysis and processing in a form of visual graph and root cause recommendation, so that an administrator can be assisted to efficiently locate the fault root cause, and the accuracy of fault root cause location is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 flow chart of the method of the present invention
FIG. 2 is a heterogeneous topology diagram of the system and service code invocation relationship of the present invention
FIG. 3 is a time series anomaly detection model based on multi-dimensional indexes
FIG. 4 is a schematic diagram of the index abnormality detection result of the present invention
FIG. 5 heterogeneous fault map of the present invention
FIG. 6 is a visual interface for fault root cause location in accordance with the present invention
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and thus the scope of the present invention is not limited by the specific embodiments disclosed below.
Example one
In order to solve the problems in the prior art, the present embodiment provides a method for locating a fault root cause based on a service code level, where a flowchart is shown in fig. 1, and the method mainly includes the following steps:
s1 constructs a global heterogeneous topology graph including intersystem call relationships and service intersystem call relationships.
In order to locate the exception and root cause of a finer-grained service code level, the invention provides a composition strategy of a mixed relation between a service code and an application system. In addition, if a system call forwarded by using the enterprise service bus system ESB _ F5 exists, the service code calling relationship and the service code membership in the upstream and downstream systems can be obtained by arranging the CMDB service calling comparison table. The construction process of the heterogeneous topology map is described below with actual sample data.
The service monitoring system collects the call data and the state of the service transaction in detail in the log, for example, the call log at a certain alarm time is analyzed and then is shown in the following table 1:
TABLE 1 parsed transaction detail data
It can be seen that the data at this time includes the system nodes S1, S2, S3, S4 and the service code nodes T1, T2 called by them. The call relationship existing among the nodes is considered comprehensively, a heterogeneous topological graph including the call relationship of the system nodes and the service code nodes is constructed and obtained as shown in fig. 2, the call relationship graph reflecting the global system and the service code is obtained in fig. 2, wherein each call edge is time sequence index data formed by aggregation of transaction detail data and set time granularity, and the indexes adopted by the invention comprise: transaction amount, success amount, response amount, failure amount, non-response amount, success rate, response time. Compared with the prior art which only relates to the calling topological graph among the systems, the heterogeneous topological graph comprising the calling relation among the systems and the service codes, which is used by the invention, can capture the calling relation and the membership relation of the service codes with finer granularity, and the representation form is concise and clear.
Due to the fact that the traffic volume in actual service is large and complex, the obtained global heterogeneous topological graph is often complex. However, only local services are affected when a fault occurs in an actual production environment, so that the invention proposes to call the global heterogeneous topological graph and detect the abnormity at the same time so as to obtain the local heterogeneous topological graph with the fault.
S2, a time series anomaly detection model based on the multi-dimensional indexes is constructed, and anomaly detection is carried out on the calling edge of the global heterogeneous topological graph. The time series anomaly detection model based on the multi-dimensional indexes is constructed through a graph attention machine mechanism as shown in fig. 3. S2 specifically includes the following steps:
s2.1, normalizing the time sequence of the time window corresponding to the n indexes;
wherein n represents eachCalling the number of KPI indexes counted at the edge, converting the n KPI indexes into nodes for representing in order to consider the correlation characteristics among all the indexes, namely, the ith index corresponds to the node vi. Obtaining input characteristics { v) corresponding to n KPI indexes by adopting a min-max normalization method1,v2,…,vnTherein ofNode viAnd representing a w-dimensional feature vector corresponding to the ith KPI, wherein the dimension w of the feature vector corresponds to the dimension of the time window.
S2.2 learning the fusion characteristics of the nodes through graph attention mechanism.
Node viFusion feature h ofiCalculated by the following formula:
wherein N (i) represents a node viV set of neighbor nodes ofjRepresenting a node viA represents a sigmoid activation function, aijRepresenting a node viAnd node vjAssociated weight of, node VjRepresenting the w-dimensional feature vector corresponding to the j index, and associating the weight aijCalculated by the following formula:
wherein,eijrepresenting a node viAnd node vjAttention value of calling edge in between, eilRepresenting a node viAnd node vlThe attention value of the calling edge in between,representation featureConnecting operation, LeakyReLU is an activation function, W represents a learnable parameter matrix, L represents vjThe number of neighbor nodes of a node, l represents viSequence numbers of neighbor nodes of the node.
Calculating to obtain the fusion characteristics of all nodes by using HiAnd (4) showing.
S2.3 fusion characteristics H based on all obtained nodesiAnd learning to obtain the embedded characteristics of the time series corresponding to different indexes.
After the learning of the graph attention machine, the fusion characteristics H of all the nodesiThe output feature dimension is n x w, the n x 2w dimension feature is obtained by connecting the output feature dimension with the original sequence feature, then the long-term time sequence dependent feature is input into the LSTM module to be coded, and the embedded feature of the time sequence corresponding to different indexes is obtained by learning.
S2.4 obtaining the predicted values of the time series of all the indexes at the t moment based on the obtained embedding characteristics of the time series corresponding to the different indexes
Specifically, the embedded characteristics of all the indexes are input into a multi-layer perceptron MLP to obtain predicted values of all time sequences at t momentTaking a mean square error loss function MSE as an optimization function:
where n represents the number of predicted indices.
S2.5 the predicted values at time t based on the time series of all the obtained indicatorsCalculating to obtain abnormal score value score representing index deviation degreei(t)。
Wherein the deviation value for the i-th index is calculated by the following formula:
the deviation value of the index is normalized by the following formula:
wherein, scorei(t) is the value of the abnormality score,andand respectively representing the median and the quartile instead of the mean and the standard deviation, and experiments prove that the normalization effect has the optimal expression effect. By adopting a time series abnormity detection model based on multiple indexes, the invention can more intuitively observe the deviation degree of each index.
S2.6 score based on the obtained abnormality score value scoreiAnd (t) judging whether the calling edge is abnormal or not. Specifically, the abnormality score value score representing the degree of deviation of the index to be obtainedi(t) comparing the abnormality score value score with a preset threshold value when the abnormality score value score is higher than the preset threshold valueiAnd (t) when the threshold value is larger than the threshold value, judging that the detection result of the calling edge is abnormal. The detection result is shown in fig. 4, where red sides indicate abnormality and black sides indicate normality.
Compared with the traditional time series anomaly detection method, the time series anomaly detection model based on the multi-dimensional indexes, which is constructed by the invention, does not depend on any hypothesis of data distribution, and takes the correlation dependence characteristics among the multi-dimensional indexes called by the service into consideration, so that the anomaly detection is more accurate and efficient.
S3, generating a heterogeneous fault map based on the abnormal detection result of each calling edge.
Specifically, based on S2, an abnormal calling edge in the heterogeneous topology map is obtained, and data of the calling edge whose detection result is normal is filtered from the global heterogeneous topology map, so as to obtain a heterogeneous fault map in which only a fault portion is displayed. For example, filtering the global heterogeneous topology map of fig. 2 results in a heterogeneous fault map as shown in fig. 5.
S4, based on the random walk object level sorting algorithm, the fault root cause positioning is carried out on the obtained heterogeneous fault graph.
Specifically, S4 includes the following steps:
s4.1, based on the heterogeneous fault map generated in S3, an object set V and an object type set A are determined.
Specifically, the heterogeneous fault map generated by S3 can be formally expressed asWherein ν, ε represents the object set and the relationship set, respectively. Setting object type mapping function due to the fact that heterogeneous graph comprises multiple types of objectsWherein A represents a set of object types which are not repeated after mapping, and objects of the same type of a plurality of different instances are mapped to corresponding object types through a mapping function.
And S4.2, distributing corresponding abnormal propagation factors for different object types based on the obtained object type set A.
And distributing corresponding abnormal propagation factors for different object types based on the importance degrees of the different object types in the heterogeneous fault graph. Specifically, the abnormal propagation factors of different object types can be obtained through distribution by expert knowledge or learning by combining search optimization algorithms, such as simulated annealing optimization algorithms, based on historical data.
Compared with the method that the abnormal propagation differences among different object types are not considered in the prior art, the method and the device for calculating the root cause score effectively improve the accuracy and pertinence of subsequent root cause score calculation by setting the abnormal propagation factors among different object types and expressing the differences of the abnormal propagation weights among different object types.
S4.3 based on the obtained object set V, iteratively calculating by adopting a PageRank algorithm to obtain a pivot value of each object as an initial root factor score R of each objectea。
Where a represents any object in the set of objects V.
S4.4 determining a root cause score R of each object based on the obtained abnormal propagation factor and the initial root cause scorex。
Specifically, the root cause fraction R of the object x is obtained by the following formulax:
X, Y respectively represents an object set with the type of X and an object set with the type of Y in the object type set A, wherein X represents an object in the object set with the type of X, and Y represents an object in the object set with the type of Y; rxAnd RyRoot scores representing object x and object y, respectively; mxYIs a contiguous matrix, MxYM is used as element inxYMeaning that if there is a relationship between object x and object type Y, then mxYNum (x, Y); if there is no relationship between object x and object type Y, then mxY0; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma rayXYRepresenting an exception propagation factor between object type X and object type Y,ε represents the attenuation factor, selected based on expert knowledge.
The invention effectively solves the problem that the initial root factor score does not consider the relation between different object types by combining the object sorting algorithm of the heterogeneous graph.
And S4.5, selecting the object corresponding to the root cause score of top-K as a fault root cause positioning result based on the obtained root cause score of each object.
Wherein the root score of top-K represents the first K largest root scores.
Specifically, the obtained fault root cause positioning result is displayed in a visual form, as shown in fig. 6, for reference by a system administrator.
By adopting the heterogeneous topological graph, the invention simply and clearly shows the calling relation and the membership relation of the service codes with finer granularity; by fusing the correlation characteristics of the multi-dimensional indexes, the accuracy of index abnormality detection of the calling edge in the heterogeneous topological structure is effectively improved; the method has the advantages that root cause positioning is carried out by adopting a node sorting algorithm of a heterogeneous graph, not only are pivot values of abnormal propagation of objects in the heterogeneous graph considered, but also abnormal propagation causes among different object types are considered, after system monitoring data pass through the algorithm processing framework, heterogeneous fault graphs and root cause systems corresponding to current alarms are obtained by combining automatic processing of a machine learning algorithm, and are simply displayed to the system for analysis and processing in a visual form and a root cause recommending form, so that an administrator can be assisted to efficiently position a fault root cause, and the accuracy of fault root cause positioning is effectively improved.
Example two
The embodiment provides a fault root cause positioning system based on service code level, which mainly comprises the following modules:
and the global heterogeneous topological graph generating module is used for constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation.
In order to locate the service code level abnormity and root cause with finer granularity, the invention provides a composition strategy of mixed relation between the service code and an application system. In addition, if a system call forwarded by using the enterprise service bus system ESB _ F5 exists, the service code calling relationship and the service code membership in the upstream and downstream systems can be obtained by arranging the CMDB service calling comparison table.
And the anomaly detection module is used for constructing a time series anomaly detection model based on the multi-dimensional indexes and carrying out anomaly detection on the calling edge of the global heterogeneous topological graph. Here, the abnormality detection model is constructed by a graph attention machine system as shown in fig. 3. The anomaly detection module is used for realizing the following functions:
firstly, the time series of the time windows corresponding to the n indexes are normalized.
Wherein n represents the number of KPI indexes counted by each calling edge, and in order to consider the correlation characteristics among all indexes, the n KPI indexes are converted into nodes to be represented, namely the i index corresponds to the node vi. Obtaining input characteristics { v) corresponding to n KPI indexes by adopting a min-max normalization method1,v2,…,vnTherein ofNode viAnd representing a w-dimensional feature vector corresponding to the ith KPI, wherein the dimension w of the feature vector corresponds to the dimension of the time window.
And learning fusion characteristics of different nodes through a graph attention mechanism.
In particular, node viFusion feature h ofiCalculated by the following formula:
wherein N (i) represents a node viV set of neighbor nodes ofjRepresenting a node viA represents a sigmoid activation function, aijRepresenting a node viAnd node vjAssociated weight of, node VjRepresenting the w-dimensional feature vector corresponding to the j index, and associating the weight aijCalculated by the following formula:
wherein e isijRepresenting a node viAnd node vjAttention value of calling edge in between, eilRepresenting a node viAnd node vlThe attention value of the calling edge in between,representing a characteristic join operation, LeakyReLU being an activation function, W representing a learnable parameter matrix, L representing vjThe number of neighbor nodes of a node, l represents viSequence numbers of neighbor nodes of the node. Calculating to obtain the fusion characteristics of all nodes by using HiAnd (4) showing.
Fusion characteristic H based on all obtained nodesiAnd learning to obtain the embedded characteristics of the time series corresponding to different indexes.
After the learning of the graph attention machine, the fusion characteristics H of all the nodesiThe output feature dimension is n x w, the n x 2w dimension feature is obtained by connecting the output feature dimension with the original sequence feature, then the long-term time sequence dependent feature is input into the LSTM module to be coded, and the embedded feature of the time sequence corresponding to different indexes is obtained by learning.
Obtaining the predicted values of the time series of all the indexes at the time t based on the obtained embedding characteristics of the time series corresponding to the different indexes
Specifically, the embedded characteristics of all the indexes are input into a multi-layer perceptron MLP to obtain predicted values of all time sequences at t momentTaking a mean square error loss function MSE as an optimization function:
where n represents the number of predicted indices. The invention adopts a time series abnormity detection model based on multiple indexes, and the deviation degree of each index can be observed more intuitively.
Based onThe predicted values of the time series of all the indexes at the time t are obtainedCalculating to obtain abnormal score value score representing index deviation degreei(t)。
Wherein the deviation value for the i-th index is calculated by the following formula:
the deviation value of the index is normalized by the following formula:
wherein, scorei(t) is the value of the abnormality score,andand respectively representing the median and the quartile instead of the mean and the standard deviation, and experiments prove that the normalization effect has the optimal expression effect. By adopting a time series abnormity detection model based on multiple indexes, the invention can more intuitively observe the deviation degree of each index.
Based on the obtained abnormality score value scoreiAnd (t) judging whether the calling edge is abnormal or not.
Specifically, the abnormality score value score representing the degree of deviation of the index to be obtainedi(t) comparing the abnormality score value score with a preset threshold value when the abnormality score value score is higher than the preset threshold valueiAnd (t) when the threshold value is larger than the threshold value, judging that the detection result of the calling edge is abnormal. The detection result is shown in fig. 4, where red sides indicate abnormality and black sides indicate normality.
Compared with the traditional time series anomaly detection method, the time series anomaly detection model based on the multi-dimensional indexes, which is constructed by the invention, does not depend on any hypothesis of data distribution, and takes the correlation dependence characteristics among the multi-dimensional indexes called by the service into consideration, so that the anomaly detection is more accurate and efficient.
And the heterogeneous fault map generation module is used for generating a heterogeneous fault map based on the abnormal detection result of each calling edge.
Specifically, the abnormal calling side in the heterogeneous topological graph is obtained based on the abnormal detection module, data of the calling side with a detection result being normal is filtered from the global heterogeneous topological graph, and the heterogeneous fault graph which only displays the fault part is obtained. For example, filtering the global heterogeneous topology map of fig. 2 results in a heterogeneous fault map as shown in fig. 5.
And the fault root cause positioning module is used for positioning the fault root cause of the obtained heterogeneous fault graph based on a random walk object level sorting algorithm.
Specifically, the fault root cause positioning module is used for realizing the following functions:
and determining an object set V and an object type set A based on the heterogeneous fault graphs generated by the heterogeneous fault graph generation module.
Specifically, the heterogeneous fault map generated by the heterogeneous fault map generation module can be formally expressed asWherein ν, ε represents the object set and the relationship set, respectively. Setting object type mapping function due to the fact that heterogeneous graph comprises multiple types of objectsWherein A represents a set of object types which are not repeated after mapping, and objects of the same type of a plurality of different instances are mapped to corresponding object types through a mapping function.
And distributing corresponding abnormal propagation factors for different object types based on the obtained object type set A.
And distributing corresponding abnormal propagation factors for different object types based on the importance degrees of the different object types in the heterogeneous fault graph. Specifically, the abnormal propagation factors of different object types can be obtained through distribution by expert knowledge or learning by combining search optimization algorithms, such as simulated annealing optimization algorithms, based on historical data.
Compared with the method that the abnormal propagation differences among different object types are not considered in the prior art, the method and the device for calculating the root cause score effectively improve the accuracy and pertinence of subsequent root cause score calculation by setting the abnormal propagation factors among different object types and expressing the differences of the abnormal propagation weights among different object types.
Based on the obtained object set V, a PageRank algorithm is adopted to iteratively calculate a pivot value of each object as an initial root factor score R of each objectea。
Where a represents any object in the set of objects V.
Determining a root cause score R of each object based on the obtained abnormal propagation factor and the initial root cause scorex。
Specifically, the root cause fraction R of the object x is obtained by the following formulax:
X, Y respectively represents an object set with the type of X and an object set with the type of Y in the object type set A, wherein X represents an object in the object set with the type of X, and Y represents an object in the object set with the type of Y; rxAnd RyRoot scores representing object x and object y, respectively; mxYIs a contiguous matrix, MxYM is used as element inxYMeaning that if there is a relationship between object x and object type Y, then mxYNum (x, Y); if there is no relationship between object x and object type Y, then mxY0; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma rayXYRepresenting an exception propagation factor between object type X and object type Y,ε represents the attenuation factor, selected based on expert knowledge.
The invention effectively solves the problem that the initial root factor score does not consider the relation between different object types by combining the object sorting algorithm of the heterogeneous graph.
And selecting an object corresponding to the root factor score of top-K as a fault root factor positioning result based on the obtained root factor score of each object.
Wherein the root score of top-K represents the first K largest root scores.
Specifically, the obtained fault root cause positioning result is displayed in a visual form, as shown in fig. 6, for reference by a system administrator.
Example three:
the present embodiment provides a storage medium storing a computer program; when the computer program is executed by a processor in a computer device, the computer device performs the method as described in any one of the above.
By adopting the heterogeneous topological graph, the invention simply and clearly shows the calling relation and the membership relation of the service codes with finer granularity; by fusing the correlation characteristics of the multi-dimensional indexes, the accuracy of index abnormality detection of the calling edge in the heterogeneous topological structure is effectively improved; the method has the advantages that root cause positioning is carried out by adopting a node sorting algorithm of a heterogeneous graph, not only are pivot values of abnormal propagation of objects in the heterogeneous graph considered, but also abnormal propagation causes among different object types are considered, after system monitoring data pass through the algorithm processing framework, heterogeneous fault graphs and root cause systems corresponding to current alarms are obtained by combining automatic processing of a machine learning algorithm, and are simply displayed to the system for analysis and processing in a visual form and a root cause recommending form, so that an administrator can be assisted to efficiently position a fault root cause, and the accuracy of fault root cause positioning is effectively improved.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that the embodiments may be practiced without the specific details. Thus, the foregoing descriptions of specific embodiments described herein are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed. It will be apparent to those skilled in the art that many modifications and variations are possible in light of the above teaching. Further, as used herein to refer to the position of a component, the terms above and below, or their synonyms, do not necessarily refer to an absolute position relative to an external reference, but rather to a relative position of the component with reference to the drawings.
Moreover, the foregoing drawings and description include many concepts and features that may be combined in various ways to achieve various benefits and advantages. Thus, features, components, elements and/or concepts from various different figures may be combined to produce embodiments or implementations not necessarily shown or described in this specification. Furthermore, not all features, components, elements and/or concepts shown in a particular figure or description are necessarily required to be in any particular embodiment and/or implementation. It is to be understood that such embodiments and/or implementations fall within the scope of the present description.
Claims (10)
1. A fault root cause positioning method based on service code level is characterized by comprising the following steps:
s1, constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation;
s2, constructing a time series anomaly detection model based on multi-dimensional indexes, and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
s3, generating a heterogeneous fault map based on the abnormal detection result of each calling edge;
s4, based on the random walk object level sorting algorithm, fault root cause positioning is carried out on the obtained heterogeneous fault graph.
2. The method according to claim 1, wherein each of the calling edges of the global heterogeneous topology map is time-series index data generated by aggregating transaction detail data and set time granularity, and the index data at least includes a combination of two or more of transaction amount, success amount, response amount, failure amount, non-response amount, success rate, response rate, and response time.
3. The method for locating a root cause of a fault based on a service code level as claimed in claim 1, wherein the step of S2 further comprises the steps of:
s2.1, normalizing the time sequence of the time window corresponding to the n indexes;
s2.2, learning the fusion characteristics of the nodes through graph attention mechanism;
s2.3 fusion characteristics H based on all obtained nodesiLearning to obtain embedded characteristics of time series corresponding to different indexes;
s2.4 obtaining the predicted values of the time series of all the indexes at the t moment based on the obtained embedding characteristics of the time series corresponding to different indexes
S2.5 predicted value at t moment based on time series of all obtained indexesCalculating to obtain an abnormality score value score representing the degree of deviation of the indexi(t);
S2.6 score based on obtained abnormality score value scoreiAnd (t) judging whether the calling edge is abnormal or not.
4. A method according to claim 3, wherein the S2.2 learning the fusion characteristics of the nodes through the graph attention mechanism includes:
fusion feature h of node iiCalculated by the following formula:
wherein N (i) represents a node viV set of neighbor nodes ofjRepresenting a node viThe neighbor nodes of (a) are,
wherein N (i) represents a node viV set of neighbor nodes ofjRepresenting a node viA represents a sigmoid activation function, aijRepresenting a node viAnd node vjAssociated weight of, node vjRepresenting a w-dimensional feature vector corresponding to the jth KPI index;
associated weight aijCalculated by the following formula:
wherein,eijrepresenting a node viAnd node vjAttention value of calling edge in between, eilRepresenting a node viAnd node vlThe attention value of the calling edge in between,representing a characteristic join operation, LeakyReLU being an activation function, W representing a learnable parameter matrix, L representing vjThe number of neighbor nodes of a node, l represents viSequence numbers of neighbor nodes of the node.
5. The method for locating a root cause of a fault based on a service code level as claimed in claim 1, wherein the step of S4 comprises the steps of:
s4.1, determining an object set V and an object type set A based on the heterogeneous fault map generated in the S3;
s4.2, distributing corresponding abnormal propagation factors for different object types based on the obtained object type set A;
s4.3 based on the obtained object set V, iteratively calculating by adopting a PageRank algorithm to obtain a pivot value of each object as an initial root factor score R of each objectea;
S4.4 determining a root cause score R of each object based on the obtained abnormal propagation factor and the initial root cause scorex;
And S4.5, selecting the object corresponding to the root cause score of top-K as a fault root cause positioning result based on the obtained root cause score of each object.
6. The method for fault root cause location based on service code level of claim 5, wherein the S4.2 comprises: the abnormal propagation factor is distributed through expert knowledge or a combined search optimization algorithm based on historical data.
7. The method for fault root cause location based on service code level of claim 5, wherein the S4.4 comprises:
root cause score R of object xxCalculated by the following formula:
x, Y respectively represents an object set with the type of X and an object set with the type of Y in the object type set A, wherein X represents an object in the object set with the type of X, and Y represents an object in the object set with the type of Y; rxAnd RyRoot scores representing object x and object y, respectively; mxYIs a contiguous matrix, MxYM is used as element inxYMeaning that if there is a relationship between object x and object type Y, then mxYNum (x, Y); if there is no relationship between object x and object type Y, then mxY0; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma rayXYRepresenting object type X and object type YThe abnormal propagation factor of the abnormal wave in the middle,ε represents the attenuation factor.
8. The method for fault root location based on service code level of claim 5, wherein the root score of top-K represents the top K largest root scores; said S4.5 further comprises: and displaying the obtained fault root cause positioning result in a visual form.
9. A fault root cause positioning system based on service code level is characterized in that the system mainly comprises the following modules:
the global heterogeneous topological graph generating module is used for constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation;
the anomaly detection module is used for constructing a time series anomaly detection model based on multi-dimensional indexes and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
the heterogeneous fault map generation module is used for generating a heterogeneous fault map based on the abnormal detection result of each calling edge;
and the fault root cause positioning module is used for positioning the fault root cause of the obtained heterogeneous fault graph based on a random walk object level sorting algorithm.
10. A storage medium, characterized in that it stores a computer program; the computer device performs the method of any one of claims 1-8 when the computer program is executed by a processor in the computer device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111127982.7A CN113900844B (en) | 2021-09-26 | 2021-09-26 | Fault root cause positioning method, system and storage medium based on service code level |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111127982.7A CN113900844B (en) | 2021-09-26 | 2021-09-26 | Fault root cause positioning method, system and storage medium based on service code level |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113900844A true CN113900844A (en) | 2022-01-07 |
CN113900844B CN113900844B (en) | 2024-07-09 |
Family
ID=79029270
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111127982.7A Active CN113900844B (en) | 2021-09-26 | 2021-09-26 | Fault root cause positioning method, system and storage medium based on service code level |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113900844B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114598539A (en) * | 2022-03-16 | 2022-06-07 | 京东科技信息技术有限公司 | Root cause positioning method and device, storage medium and electronic equipment |
CN114615019A (en) * | 2022-02-15 | 2022-06-10 | 北京云集智造科技有限公司 | Anomaly detection method and system based on micro-service topological relation generation |
CN114896093A (en) * | 2022-05-10 | 2022-08-12 | 国泰君安证券股份有限公司 | Method, device, processor and storage medium for realizing fault root recommendation processing of multi-component software system based on index correlation |
CN115333921A (en) * | 2022-08-20 | 2022-11-11 | 海南大学 | Micro-service abnormal root cause positioning method and device |
CN115514617A (en) * | 2022-09-13 | 2022-12-23 | 上海驻云信息科技有限公司 | Universal abnormal root cause positioning and analyzing method and device |
CN115509789B (en) * | 2022-09-30 | 2023-08-11 | 中国科学院重庆绿色智能技术研究院 | Method and system for predicting faults of computing system based on component call analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105160181A (en) * | 2015-09-02 | 2015-12-16 | 华中科技大学 | Detection method of abnormal data of numerical control system instruction field sequence |
CN110888755A (en) * | 2019-11-15 | 2020-03-17 | 亚信科技(中国)有限公司 | Method and device for searching abnormal root node of micro-service system |
CN111597070A (en) * | 2020-07-27 | 2020-08-28 | 北京必示科技有限公司 | Fault positioning method and device, electronic equipment and storage medium |
CN112698975A (en) * | 2020-12-14 | 2021-04-23 | 北京大学 | Fault root cause positioning method and system of micro-service architecture information system |
WO2021179643A1 (en) * | 2020-03-12 | 2021-09-16 | 华为技术有限公司 | Fault processing method, apparatus and system |
-
2021
- 2021-09-26 CN CN202111127982.7A patent/CN113900844B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105160181A (en) * | 2015-09-02 | 2015-12-16 | 华中科技大学 | Detection method of abnormal data of numerical control system instruction field sequence |
CN110888755A (en) * | 2019-11-15 | 2020-03-17 | 亚信科技(中国)有限公司 | Method and device for searching abnormal root node of micro-service system |
WO2021179643A1 (en) * | 2020-03-12 | 2021-09-16 | 华为技术有限公司 | Fault processing method, apparatus and system |
CN111597070A (en) * | 2020-07-27 | 2020-08-28 | 北京必示科技有限公司 | Fault positioning method and device, electronic equipment and storage medium |
CN112698975A (en) * | 2020-12-14 | 2021-04-23 | 北京大学 | Fault root cause positioning method and system of micro-service architecture information system |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114615019A (en) * | 2022-02-15 | 2022-06-10 | 北京云集智造科技有限公司 | Anomaly detection method and system based on micro-service topological relation generation |
CN114615019B (en) * | 2022-02-15 | 2024-01-16 | 北京云集智造科技有限公司 | Anomaly detection method based on micro-service topological relation generation |
CN114598539A (en) * | 2022-03-16 | 2022-06-07 | 京东科技信息技术有限公司 | Root cause positioning method and device, storage medium and electronic equipment |
CN114598539B (en) * | 2022-03-16 | 2024-03-01 | 京东科技信息技术有限公司 | Root cause positioning method and device, storage medium and electronic equipment |
CN114896093A (en) * | 2022-05-10 | 2022-08-12 | 国泰君安证券股份有限公司 | Method, device, processor and storage medium for realizing fault root recommendation processing of multi-component software system based on index correlation |
CN115333921A (en) * | 2022-08-20 | 2022-11-11 | 海南大学 | Micro-service abnormal root cause positioning method and device |
CN115333921B (en) * | 2022-08-20 | 2024-03-29 | 海南大学 | Micro-service abnormal root cause positioning method and device |
CN115514617A (en) * | 2022-09-13 | 2022-12-23 | 上海驻云信息科技有限公司 | Universal abnormal root cause positioning and analyzing method and device |
CN115509789B (en) * | 2022-09-30 | 2023-08-11 | 中国科学院重庆绿色智能技术研究院 | Method and system for predicting faults of computing system based on component call analysis |
Also Published As
Publication number | Publication date |
---|---|
CN113900844B (en) | 2024-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113900844A (en) | Service code level-based fault root cause positioning method, system and storage medium | |
US11151502B2 (en) | Real-time adaptive operations performance management system | |
CN111858123B (en) | Fault root cause analysis method and device based on directed graph network | |
KR102118670B1 (en) | System and method for management of ict infra | |
WO2021213247A1 (en) | Anomaly detection method and device | |
US11012289B2 (en) | Reinforced machine learning tool for anomaly detection | |
CN114785666B (en) | Network troubleshooting method and system | |
US11200103B2 (en) | Using a machine learning module to perform preemptive identification and reduction of risk of failure in computational systems | |
US10379146B2 (en) | Detecting non-technical losses in electrical networks based on multi-layered statistical techniques from smart meter data | |
CN110032463B (en) | System fault positioning method and system based on Bayesian network | |
CN115237717A (en) | Micro-service abnormity detection method and system | |
CN115514627B (en) | Fault root cause positioning method and device, electronic equipment and readable storage medium | |
CN112415331B (en) | Power grid secondary system fault diagnosis method based on multi-source fault information | |
CN113962273B (en) | Multi-index-based time series anomaly detection method and system and storage medium | |
Marashi et al. | Identification of interdependencies and prediction of fault propagation for cyber–physical systems | |
US20230102786A1 (en) | Ccontinuous knowledge graph generation using causal event graph feedback | |
CN115758173A (en) | Cloud platform system anomaly detection method and device based on parallel graph attention network | |
Gupta et al. | A supervised deep learning framework for proactive anomaly detection in cloud workloads | |
CN111027591B (en) | Node fault prediction method for large-scale cluster system | |
CN108268351B (en) | Method and system for accurately monitoring process running state | |
Xu et al. | Integrated system health management-oriented maintenance decision-making for multi-state system based on data mining | |
CN116668264A (en) | Root cause analysis method, device, equipment and storage medium for alarm clustering | |
CN114710397B (en) | Service link fault root cause positioning method and device, electronic equipment and medium | |
US20230076662A1 (en) | Automatic suppression of non-actionable alarms with machine learning | |
CN116074181A (en) | Service fault root cause positioning method and device based on graph reasoning under influence of protection mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |