CN113900844B - Fault root cause positioning method, system and storage medium based on service code level - Google Patents

Fault root cause positioning method, system and storage medium based on service code level Download PDF

Info

Publication number
CN113900844B
CN113900844B CN202111127982.7A CN202111127982A CN113900844B CN 113900844 B CN113900844 B CN 113900844B CN 202111127982 A CN202111127982 A CN 202111127982A CN 113900844 B CN113900844 B CN 113900844B
Authority
CN
China
Prior art keywords
fault
heterogeneous
root cause
node
calling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111127982.7A
Other languages
Chinese (zh)
Other versions
CN113900844A (en
Inventor
沈梦家
曹立
隋楷心
刘大鹏
王继斌
张文池
吴楠
陈恒茂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bishi Technology Co ltd
Original Assignee
Beijing Bishi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bishi Technology Co ltd filed Critical Beijing Bishi Technology Co ltd
Priority to CN202111127982.7A priority Critical patent/CN113900844B/en
Publication of CN113900844A publication Critical patent/CN113900844A/en
Application granted granted Critical
Publication of CN113900844B publication Critical patent/CN113900844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0775Content or structure details of the error report, e.g. specific table structure, specific error fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a service code level-based fault root cause positioning method, a service code level-based fault root cause positioning system and a storage medium, wherein the service code level-based fault root cause positioning method comprises the following steps: constructing a global heterogeneous topological graph comprising calling relations among systems and calling relations among service codes; constructing a time sequence anomaly detection model based on multidimensional indexes, and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph; generating a heterogeneous fault graph based on the abnormal detection result of each calling edge; and (3) based on a random walk object level ordering algorithm, performing fault root cause positioning on the obtained heterogeneous fault map. The heterogeneous topological graph is adopted, so that the service code calling relationship and the membership relationship with finer granularity are displayed succinctly and clearly; by fusing the association characteristics of the multidimensional indexes, the accuracy of index anomaly detection of the calling edge in the heterogeneous topological structure is effectively improved; the accuracy of fault root cause positioning is effectively improved through a node ordering algorithm of the heterogeneous graph.

Description

Fault root cause positioning method, system and storage medium based on service code level
Technical Field
The invention relates to fault root cause positioning, in particular to fault root cause positioning based on service code level.
Background
With the rapid development of technologies such as cloud computing, service computing, and the increasing demand for business by social production, more and more modern enterprises deploy applications and system services in a cloud computing environment, referred to as distributed cloud applications or micro-services. Compared with the traditional centralized architecture, the distributed architecture has better component expansibility, higher development productivity and lower cost.
To ensure high availability and reliability of the system, application providers must deploy link monitoring systems to collect key performance indicators for each service, such as network response time, service response rate, and success rate, to handle complex distributed environments to meet availability constraints and stringent service level objectives. However, as the service requirements are increasingly complex and the micro-service scale is increasingly large, when faults occur, a large number of index alarms are generated due to the multiple calling dependency relationship among the cross systems, and at this time, a system administrator faces to massive alarm index information and is difficult to quickly check key alarm indexes and corresponding fault root cause systems only by means of manual analysis, so that the machine learning algorithm is required to automatically process and analyze monitoring index data and system topological relationship so as to quickly locate the fault root cause systems.
However, most of the existing link tracking monitoring systems only collect call relation data between systems, perform fault root cause positioning based on call relation of a system layer, and do not consider key information of service codes of system call, so that the existing scheme is difficult to position to a fine-grained fault root cause problem, and abnormal information is easily hidden due to coarse-grained data aggregation information of the system layer.
In addition, due to the complexity and periodicity of the service, the existing simple anomaly detection strategy based on the fixed threshold or k-sigma can have more false positives or false negatives, for example, the effect of the alarm rule that the response rate is lower than 90% and the time exceeds 3 minutes in different services is not satisfactory, and the ideal effect is difficult to achieve. In addition, most of the conventional anomaly detection algorithms only aim at a single index to perform anomaly detection triggering alarm, complex dependency relationships among a plurality of key performance indexes are not considered, false alarm is easy to occur, and especially in an index anomaly detection scene of a finer-granularity calling edge in a heterogeneous topology structure, the false alarm rate is higher.
Finally, aiming at the data scene combined with the system and the service code, the current academia and industry mostly adopt the same level of call data for analysis, but most of actual scenes relate to different levels of call data, and the situation is more complex. Therefore, a solution for locating the root cause of the failure of the fusion system and the service code needs to be proposed.
Disclosure of Invention
In order to solve the above problems existing in the prior art, the present invention provides:
A fault root cause positioning method based on service code level mainly comprises the following steps:
s1, constructing a global heterogeneous topological graph comprising calling relations among systems and calling relations among service codes;
s2, constructing a time sequence anomaly detection model based on multidimensional indexes, and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
S3, generating a heterogeneous fault diagram based on an abnormal detection result of each calling edge;
s4, positioning the fault root cause of the obtained heterogeneous fault map based on a random walk object level ordering algorithm.
A fault root cause positioning system based on service code level mainly comprises the following modules:
The global heterogeneous topological graph generation module is used for constructing a global heterogeneous topological graph comprising calling relations among systems and calling relations among service codes;
the anomaly detection module is used for constructing a time sequence anomaly detection model based on multidimensional indexes and carrying out anomaly detection on each call edge of the global heterogeneous topological graph;
The heterogeneous fault diagram generation module is used for generating a heterogeneous fault diagram based on the abnormal detection result of each calling edge;
And the fault root positioning module is used for positioning the fault root of the obtained heterogeneous fault graph based on a random walk object level ordering algorithm.
A storage medium storing a computer program; the computer device performs the method of any of the above claims when the computer program is executed by a processor in the computer device.
According to the invention, by constructing the heterogeneous topological graph, the service code calling relationship and the membership relationship with finer granularity are displayed succinctly and clearly; by fusing the associated characteristics of the multidimensional indexes, a time sequence anomaly detection model based on the multidimensional indexes is constructed, so that anomaly detection of the calling edge of the global heterogeneous topological graph is realized, and compared with the technical problem of high false alarm rate caused by anomaly detection only aiming at a single index in the prior art, the method and the device effectively improve the accuracy rate of the anomaly detection of the indexes of the calling edge in the heterogeneous topological structure; the heterogeneous fault map and root cause system corresponding to the current alarm is automatically processed by combining a node ordering algorithm of the heterogeneous map with a machine learning algorithm, and is briefly displayed to the system in a form of a visual map and root cause recommendation for subsequent analysis and processing, so that an administrator can be assisted to efficiently locate the fault source, and the accuracy of locating the fault root cause is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.
FIG. 1A flow chart of the method of the present invention
FIG. 2 is a heterogeneous topology of the system, service code invocation relationship of the present invention
FIG. 3 shows a model for detecting time series anomalies based on multidimensional indexes
FIG. 4 is a diagram showing the results of the index anomaly detection of the present invention
FIG. 5A heterogeneous fault diagram of the present invention
FIG. 6A is a visual interface for fault root location of the present invention
Detailed Description
In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced otherwise than as described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
Example 1
In order to solve the problems in the prior art, the present embodiment provides a service code level-based fault root cause positioning method, and a flowchart thereof is shown in fig. 1, and mainly includes the following steps:
s1, constructing a global heterogeneous topological graph comprising calling relations among systems and calling relations among service codes.
In order to locate anomalies and root causes at the service code level of finer granularity, the invention proposes a composition strategy of the mixed relationship between the service codes and the application system. In addition, if there is a system call forwarded by using the enterprise service bus system ESB_F5, the service code call relationship and the service code membership relationship in the upstream and downstream systems can be obtained by sorting the CMDB service call comparison table. The construction process of the heterogeneous topology map is described below with actual sample data.
The service monitoring system collects call data and state of service transaction in detail in the log, for example, after the call log at a certain alarm time is analyzed, the call data and state are shown in the following table 1:
TABLE 1 transaction detail data after parsing
It can be seen that the system nodes S1, S2, S3, S4 and the service code nodes T1, T2 called by the system nodes S1, S2 are included in the data. The call relation among the nodes is comprehensively considered, the constructed heterogeneous topological graph comprising the call relation of the system nodes and the service code nodes is shown in fig. 2, the call relation graph reflecting the global system and the service code is obtained in fig. 2, wherein each call edge is time sequence index data formed by aggregation of transaction detail data and set time granularity, and the indexes adopted by the invention comprise: transaction amount, work amount, response amount, failure amount, unresponsive amount, success rate, response time. Compared with the prior art which only relates to call topological diagrams among systems, the heterogeneous topological diagram comprising the call relations among the systems and the service codes can capture the call relations and the membership relations of the service codes with finer granularity, and the representation form is concise and clear.
The global heterogeneous topology map obtained tends to be very complex, due to the huge and very complex traffic in real traffic. However, only local business is affected when faults occur in the inter-production environment, so the invention provides that the call edge of the global heterogeneous topological graph is subjected to abnormal detection so as to obtain the local heterogeneous topological graph with faults.
S2, constructing a time sequence anomaly detection model based on multidimensional indexes, and carrying out anomaly detection on the calling edge of the global heterogeneous topological graph. The time series abnormality detection model based on the multidimensional index is constructed through a graph attention mechanism as shown in fig. 3. S2 specifically comprises the following steps:
s2.1, normalizing the time sequence of the time window corresponding to the n indexes;
N represents the number of KPI indexes counted by each calling edge, and the n KPI indexes are converted into nodes for representation in order to consider the association characteristics among all indexes, namely the ith index corresponds to the node v i. Obtaining input features { v 1,v2,…,vn } corresponding to n KPI indexes by adopting a min-max normalization method, wherein Node v i represents a w-dimensional feature vector corresponding to the ith KPI indicator, the dimension w of the feature vector corresponding to the dimension of the time window.
S2.2, learning the fusion characteristics of the nodes through a graph attention mechanism.
The fusion feature h i of node v i is calculated by the following formula:
Wherein N (i) represents a set of neighbor nodes of the node V i, V j represents a neighbor node of the node V i, σ represents a sigmoid activation function, a ij represents an association weight of the node V i and the node V j, the node V j represents a w-dimensional feature vector corresponding to the j-th index, and the association weight a ij is calculated by the following formula:
wherein, E ij represents the attention value of the calling edge between node v i and node v j, e il represents the attention value of the calling edge between node v i and node v l,Representing the feature connection operation, leakyReLU is an activation function, W represents a learnable parameter matrix, L represents the number of neighbor nodes of the v j node, and L represents the sequence number of neighbor nodes of the v i node.
And calculating to obtain fusion characteristics of all nodes, wherein the fusion characteristics are represented by H i.
S2.3, learning to obtain the embedded features of the time sequences corresponding to different indexes based on the obtained fusion features H i of all the nodes.
After learning by a graph attention mechanism, the output feature dimension of the fusion feature H i of all nodes is n x w, the output feature dimension is connected with the original sequence feature to obtain n x 2w dimension features, and then the n x 2w dimension features are input into an LSTM module to encode long-term time sequence dependent features, and the embedding features of time sequences corresponding to different indexes are obtained through learning.
S2.4 obtaining predicted values of time sequences of all indexes at the time t based on the obtained embedded features of the time sequences corresponding to the different indexes
Specifically, the embedded features of all the indexes are input into a multi-layer perceptron MLP to obtain predicted values of all time sequences at the time tThe mean square error loss function MSE is used as an optimization function:
where n represents the number of predicted indicators.
S2.5 the predicted value at time t based on the time series of all the indicators obtainedAn anomaly score i (t) representing the degree of deviation of the index is calculated.
Wherein the deviation value for the i-th index is calculated by the following formula:
The deviation value of the index is normalized by the following formula:
wherein score i (t) is an outlier value, AndThe experiment shows that the normalization effect is optimal. By adopting the time sequence abnormality detection model based on multiple indexes, the method can more intuitively observe the deviation degree of each index.
S2.6, judging whether the calling edge is abnormal or not based on the obtained abnormal score value score i (t). Specifically, the obtained anomaly score i (t) representing the deviation degree of the index is compared with a preset threshold, and when the anomaly score i (t) is larger than the threshold, the detection result of the calling edge is judged to be anomaly. As shown in fig. 4, the red side indicates abnormal and the black side indicates normal.
Compared with the traditional time sequence anomaly detection method, the time sequence anomaly detection model based on the multi-dimensional indexes, which is constructed by the invention, does not depend on any data distribution assumption, and takes correlation dependence characteristics among the multi-dimensional indexes of service call into consideration, so that anomaly detection is more accurate and efficient.
S3, generating a heterogeneous fault diagram based on the abnormal detection result of each calling edge.
Specifically, based on S2, obtaining an abnormal call edge in the heterogeneous topological graph, filtering call edge data with a normal detection result from the global heterogeneous topological graph, and obtaining a heterogeneous fault graph only showing a fault part. For example, filtering the global heterogeneous topology map of FIG. 2 results in a heterogeneous fault map as shown in FIG. 5.
S4, positioning the fault root cause of the obtained heterogeneous fault map based on a random walk object level ordering algorithm.
Specifically, S4 includes the following steps:
s4.1, determining an object set V and an object type set A based on the heterogeneous fault diagram generated in the S3.
In particular, the heterogeneous fault map generated by S3 may be formally represented asWhere v and epsilon represent the object set and the relationship set, respectively. Since the heterogeneous map contains a plurality of types of objects, an object type mapping function is setWherein A represents a set of object types that are not repeated after mapping, and objects of the same type of a plurality of different instances are mapped to corresponding object types through a mapping function.
S4.2, based on the obtained object type set A, corresponding exception propagation factors are allocated for different object types.
And based on the importance degrees of different object types in the heterogeneous fault graph, corresponding abnormal propagation factors are distributed for the different object types. Specifically, the anomaly propagation factors for different object types may be learned through distribution by expert knowledge or through a combined search optimization algorithm, such as a simulated annealing optimization algorithm, based on historical data.
Compared with the method in the prior art that the difference of the abnormal propagation among different object types is not considered, the method and the device have the advantages that the difference of the weights of the abnormal propagation among different object types is expressed by setting the abnormal propagation factors among different object types, so that the accuracy and pertinence of the subsequent root cause score calculation are effectively improved.
And S4.3, based on the obtained object set V, adopting PageRank algorithm to obtain pivot values of the objects through iterative computation, and taking the pivot values as initial root cause scores R ea of the objects.
Where a represents any object in the object set V.
S4.4 determines a root score R x for each object based on the obtained anomaly propagation factor and the initial root score.
Specifically, the root cause score R x of the object x is obtained by:
Wherein X, Y represents an object set with a type of X and an object set with a type of Y in the object type set a, respectively, X represents an object in the object set with a type of X, and Y represents an object in the object set with a type of Y; r x and R y represent root scores for object x and object y, respectively; m xY is a adjacency matrix, the elements in M xY are denoted by M xY, and if there is a relationship between object x and object type Y, then M xY =num (x, Y); if there is no relationship between object x and object type Y, m xY =0; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma XY denotes an abnormal propagation factor between the object type X and the object type Y, Epsilon represents the attenuation factor and is selected based on expert knowledge.
The method and the device effectively solve the problem that the relation among different object types is not considered by the initial root cause score through combining the object ordering algorithm of the heterogeneous graph.
S4.5, selecting an object corresponding to the root cause score of the top-K as a fault root cause positioning result based on the obtained root cause score of each object.
Wherein the root score of top-K represents the top K largest root scores.
Specifically, the obtained fault root positioning result is displayed in a visual form, as shown in fig. 6, and is provided for a system administrator to refer to.
The heterogeneous topological graph is adopted, so that the service code calling relationship and the membership relationship with finer granularity are displayed succinctly and clearly; by fusing the association characteristics of the multidimensional indexes, the accuracy of index anomaly detection of the calling edge in the heterogeneous topological structure is effectively improved; the node ordering algorithm of the heterogeneous graph is adopted to conduct root cause positioning, the pivot value of abnormal propagation of objects in the heterogeneous graph is considered, the abnormal propagation cause among different object types is considered, after the system monitoring data is processed through the algorithm processing framework, the heterogeneous fault graph and the root cause system corresponding to the current alarm are obtained through automatic processing of the machine learning algorithm, and the heterogeneous fault graph and the root cause system are displayed to the system in a visual form and a root cause recommended form for analysis and processing, so that an administrator can be assisted to efficiently position a fault root cause, and the accuracy of positioning the fault root cause is effectively improved.
Example two
The embodiment provides a fault root cause positioning system based on service code level, which mainly comprises the following modules:
The global heterogeneous topological graph generation module is used for constructing a global heterogeneous topological graph comprising call relations among systems and call relations among service codes.
In order to locate service code level anomalies and root causes of finer granularity, the invention proposes a composition strategy of a mixed relation between service codes and application systems. In addition, if there is a system call forwarded by using the enterprise service bus system ESB_F5, the service code call relationship and the service code membership relationship in the upstream and downstream systems can be obtained by sorting the CMDB service call comparison table.
The anomaly detection module is used for constructing a time sequence anomaly detection model based on multidimensional indexes and carrying out anomaly detection on the call edge of the global heterogeneous topological graph. The anomaly detection model is constructed by a graph attention mechanism as shown in fig. 3. The abnormality detection module is used for realizing the following functions:
First, the time sequence of the time window corresponding to the n indexes is normalized.
N represents the number of KPI indexes counted by each calling edge, and the n KPI indexes are converted into nodes for representation in order to consider the association characteristics among all indexes, namely the ith index corresponds to the node v i. Obtaining input features { v 1,v2,…,vn } corresponding to n KPI indexes by adopting a min-max normalization methodNode v i represents a w-dimensional feature vector corresponding to the ith KPI indicator, the dimension w of the feature vector corresponding to the dimension of the time window.
The fusion characteristics of different nodes are learned through a graph attention mechanism.
Specifically, the fusion feature h i of the node v i is calculated by the following formula:
Wherein N (i) represents a set of neighbor nodes of the node V i, V j represents a neighbor node of the node V i, σ represents a sigmoid activation function, a ij represents an association weight of the node V i and the node V j, the node V j represents a w-dimensional feature vector corresponding to the j-th index, and the association weight a ij is calculated by the following formula:
Where e ij represents the attention value of the calling edge between node v i and node v j, e il represents the attention value of the calling edge between node v i and node v l, Representing the feature connection operation, leakyReLU is an activation function, W represents a learnable parameter matrix, L represents the number of neighbor nodes of the v j node, and L represents the sequence number of neighbor nodes of the v i node. And calculating to obtain fusion characteristics of all nodes, wherein the fusion characteristics are represented by H i.
Based on the obtained fusion characteristics H i of all the nodes, learning to obtain the embedded characteristics of the time sequences corresponding to different indexes.
After learning by a graph attention mechanism, the output feature dimension of the fusion feature H i of all nodes is n x w, the output feature dimension is connected with the original sequence feature to obtain n x 2w dimension features, and then the n x 2w dimension features are input into an LSTM module to encode long-term time sequence dependent features, and the embedding features of time sequences corresponding to different indexes are obtained through learning.
Obtaining predicted values of time sequences of all indexes at the time t based on the obtained embedded features of the time sequences corresponding to the different indexes
Specifically, the embedded features of all the indexes are input into a multi-layer perceptron MLP to obtain predicted values of all time sequences at the time tThe mean square error loss function MSE is used as an optimization function:
Where n represents the number of predicted indicators. The method adopts the time sequence abnormality detection model based on multiple indexes, and can more intuitively observe the deviation degree of each index.
Said predicted value at time t based on the time series of all said indicators obtainedAn anomaly score i (t) representing the degree of deviation of the index is calculated.
Wherein the deviation value for the i-th index is calculated by the following formula:
The deviation value of the index is normalized by the following formula:
wherein score i (t) is an outlier value, AndThe experiment shows that the normalization effect is optimal. By adopting the time sequence abnormality detection model based on multiple indexes, the method can more intuitively observe the deviation degree of each index.
Based on the obtained anomaly score value score i (t), it is determined whether the call edge is anomalous.
Specifically, the obtained anomaly score i (t) representing the deviation degree of the index is compared with a preset threshold, and when the anomaly score i (t) is larger than the threshold, the detection result of the calling edge is judged to be anomaly. As shown in fig. 4, the red side indicates abnormal and the black side indicates normal.
Compared with the traditional time sequence anomaly detection method, the time sequence anomaly detection model based on the multi-dimensional indexes, which is constructed by the invention, does not depend on any data distribution assumption, and takes correlation dependence characteristics among the multi-dimensional indexes of service call into consideration, so that anomaly detection is more accurate and efficient.
The heterogeneous fault diagram generation module is used for generating a heterogeneous fault diagram based on the abnormal detection result of each calling edge.
Specifically, based on the abnormality detection module, an abnormal call edge in the heterogeneous topological graph is obtained, call edge data with a normal detection result is filtered from the global heterogeneous topological graph, and a heterogeneous fault graph only displaying a fault part is obtained. For example, filtering the global heterogeneous topology map of FIG. 2 results in a heterogeneous fault map as shown in FIG. 5.
And the fault root positioning module is used for positioning the fault root of the obtained heterogeneous fault graph based on a random walk object level ordering algorithm.
Specifically, the fault root positioning module is used for realizing the following functions:
and determining an object set V and an object type set A based on the heterogeneous fault map generated by the heterogeneous fault map generation module.
In particular, the heterogeneous fault map generated by the heterogeneous fault map generation module may be formally represented asWhere v and epsilon represent the object set and the relationship set, respectively. Since the heterogeneous map contains a plurality of types of objects, an object type mapping function is setWherein A represents a set of object types that are not repeated after mapping, and objects of the same type of a plurality of different instances are mapped to corresponding object types through a mapping function.
Based on the obtained object type set A, corresponding exception propagation factors are allocated for different object types.
And based on the importance degrees of different object types in the heterogeneous fault graph, corresponding abnormal propagation factors are distributed for the different object types. Specifically, the anomaly propagation factors for different object types may be learned through distribution by expert knowledge or through a combined search optimization algorithm, such as a simulated annealing optimization algorithm, based on historical data.
Compared with the method in the prior art that the difference of the abnormal propagation among different object types is not considered, the method and the device have the advantages that the difference of the weights of the abnormal propagation among different object types is expressed by setting the abnormal propagation factors among different object types, so that the accuracy and pertinence of the subsequent root cause score calculation are effectively improved.
Based on the obtained object set V, the pivot value of each object is obtained through iterative calculation by adopting a PageRank algorithm and is used as an initial root cause score R ea of each object.
Where a represents any object in the object set V.
Based on the obtained anomaly propagation factor and the initial root score, a root score R x for each object is determined.
Specifically, the root cause score R x of the object x is obtained by:
Wherein X, Y represents an object set with a type of X and an object set with a type of Y in the object type set a, respectively, X represents an object in the object set with a type of X, and Y represents an object in the object set with a type of Y; r x and R y represent root scores for object x and object y, respectively; m xY is a adjacency matrix, the elements in M xY are denoted by M xY, and if there is a relationship between object x and object type Y, then M xY =num (x, Y); if there is no relationship between object x and object type Y, m xY =0; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma XY denotes an abnormal propagation factor between the object type X and the object type Y, Epsilon represents the attenuation factor and is selected based on expert knowledge.
The method and the device effectively solve the problem that the relation among different object types is not considered by the initial root cause score through combining the object ordering algorithm of the heterogeneous graph.
And selecting an object corresponding to the root cause score of the top-K as a fault root cause positioning result based on the obtained root cause score of each object.
Wherein the root score of top-K represents the top K largest root scores.
Specifically, the obtained fault root positioning result is displayed in a visual form, as shown in fig. 6, and is provided for a system administrator to refer to.
Embodiment III:
the present embodiment provides a storage medium storing a computer program; the computer device performs the method of any of the above claims when the computer program is executed by a processor in the computer device.
The heterogeneous topological graph is adopted, so that the service code calling relationship and the membership relationship with finer granularity are displayed succinctly and clearly; by fusing the association characteristics of the multidimensional indexes, the accuracy of index anomaly detection of the calling edge in the heterogeneous topological structure is effectively improved; the node ordering algorithm of the heterogeneous graph is adopted to conduct root cause positioning, the pivot value of abnormal propagation of objects in the heterogeneous graph is considered, the abnormal propagation cause among different object types is considered, after the system monitoring data is processed through the algorithm processing framework, the heterogeneous fault graph and the root cause system corresponding to the current alarm are obtained through automatic processing of the machine learning algorithm, and the heterogeneous fault graph and the root cause system are displayed to the system in a visual form and a root cause recommended form for analysis and processing, so that an administrator can be assisted to efficiently position a fault root cause, and the accuracy of positioning the fault root cause is effectively improved.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the embodiments. However, it will be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Thus, the foregoing descriptions of specific embodiments described herein are presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the embodiments to the precise form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art in light of the above teachings. Additionally, as used herein to refer to the position of a component, the terms above and below or their synonyms do not necessarily refer to absolute positions relative to external references, but rather to relative positions of components with reference to the figures.
Furthermore, the foregoing figures and description include many concepts and features that can be combined in various ways to achieve various benefits and advantages. Thus, features, components, elements, and/or concepts from the various figures may be combined to produce embodiments or implementations that are not necessarily shown or described in this specification. Furthermore, not all of the features, components, elements, and/or concepts illustrated in the drawings or description may be required in any particular embodiment and/or implementation. It should be understood that such embodiments and/or implementations fall within the scope of the present description.

Claims (9)

1. The fault root cause positioning method based on the service code level is characterized by comprising the following steps:
s1, constructing a global heterogeneous topological graph comprising calling relations among systems and calling relations among service codes;
s2, constructing a time sequence anomaly detection model based on multidimensional indexes, and carrying out anomaly detection on each call edge of the global heterogeneous topological graph;
s3, generating a heterogeneous fault diagram based on an abnormal detection result of each calling edge;
s4, positioning a fault root cause of the obtained heterogeneous fault map based on a random walk object level ordering algorithm;
the step S2 further comprises the following steps:
s2.1, normalizing the time sequence of the time window corresponding to the n index;
s2.2, learning the fusion characteristics of the nodes through a graph attention mechanism;
s2.3, learning to obtain embedded features of time sequences corresponding to different indexes based on the obtained fusion features Hi of all the nodes;
s2.4 obtaining predicted values of time sequences of all indexes at the time t based on the embedded characteristics of the time sequences corresponding to the obtained different indexes
S2.5 predicted values at time t based on the time series of all the obtained indicatorsCalculating to obtain an anomaly score value score (t) representing the deviation degree of the index;
S2.6, judging whether the calling edge is abnormal or not based on the obtained anomaly score value score (t).
2. The service code level based fault cause localization method of claim 1, wherein each of the call edges of the global heterogeneous topology is time series index data generated by aggregation of transaction detail data and a set time granularity, the index data comprising at least a combination of two or more of transaction amount, effort amount, response amount, failure amount, unresponsive amount, success rate, response time.
3. The service code level based fault root location method as defined in claim 1, wherein the S2.2 learning the fusion feature of the node through a graph attention mechanism comprises:
the fusion feature hi of node i is calculated by the following formula:
Wherein N (i) represents a set of neighbor nodes of the node v i, v j represents a neighbor node of the node v i, σ represents a sigmoid activation function, α ij represents an association weight of the node v i and the node v j, and the node v j represents a w-dimensional feature vector corresponding to a jth KPI index;
the association weight α ij is calculated by the following formula:
wherein, Ij represents the attention value of the calling edge between node v i and node v j, e il represents the attention value of the calling edge between node v i and node v l,Representing the feature connection operation, leakyReLU is an activation function, W represents a learnable parameter matrix, L represents the number of neighbor nodes of the v j node, and i represents the sequence number of the neighbor node of the v i node.
4. The service code level based fault root location method as defined in claim 1, wherein the S4 comprises the steps of:
s4.1, determining an object set V and an object type set A based on the heterogeneous fault diagram generated in the step S3;
s4.2, based on the obtained object type set A, corresponding exception propagation factors are distributed for different object types;
S4.3, based on the obtained object set V, adopting PageRank algorithm to obtain pivot values of all objects through iterative computation as initial root cause scores (Rea) of all objects;
s4.4, determining root cause scores Rx of each object based on the obtained abnormal propagation factors and the initial root cause scores; s4.5, selecting an object corresponding to the root cause score of the top-K as a fault root cause positioning result based on the obtained root cause score of each object.
5. The service code level based fault root location method as defined in claim 4, wherein S4.2 comprises: the anomaly propagation factors are assigned by expert knowledge or by a combinatorial search optimization algorithm based on historical data.
6. The service code level based fault root location method as defined in claim 4, wherein S4.4 comprises:
the root score Rx of object x is calculated by the following formula:
Wherein X, Y represents an object set with a type of X and an object set with a type of Y in the object type set a, respectively, X represents an object in the object set with a type of X, and Y represents an object in the object set with a type of Y; rx and Ry respectively represent root scores of an object x and an object y, and R ex is an initial root score of an x node; mxY is a adjacency matrix, the elements in MxY are denoted by mxY, mxY =num (x, Y) if there is a relationship between object x and object type Y; mxY =0 if there is no relationship between object x and object type Y; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma XY denotes an abnormal propagation factor between the object type X and the object type Y, Epsilon represents the attenuation factor.
7. The service code level based fault root cause localization method as claimed in claim 4, wherein the root cause score of top-K represents the first K largest root cause scores; the S4.5 further comprises: and displaying the obtained fault root cause positioning result in a visual form.
8. The fault root cause positioning system based on the service code level is characterized by mainly comprising the following modules:
The global heterogeneous topological graph generation module is used for constructing a global heterogeneous topological graph comprising calling relations among systems and calling relations among service codes;
The anomaly detection module is used for constructing a time sequence anomaly detection model based on multidimensional indexes and carrying out anomaly detection on each call edge of the global heterogeneous topological graph;
The heterogeneous fault diagram generation module is used for generating a heterogeneous fault diagram based on the abnormal detection result of each calling edge;
The fault root cause positioning module is used for positioning the fault root cause of the obtained heterogeneous fault map based on a random walk object level ordering algorithm;
The constructing of the time sequence anomaly detection model based on the multidimensional index, and the anomaly detection of each call edge of the global heterogeneous topological graph further comprises the following steps:
s2.1, normalizing the time sequence of the time window corresponding to the n index;
s2.2, learning the fusion characteristics of the nodes through a graph attention mechanism;
s2.3, learning to obtain embedded features of time sequences corresponding to different indexes based on the obtained fusion features Hi of all the nodes;
s2.4 obtaining predicted values of time sequences of all indexes at the time t based on the embedded characteristics of the time sequences corresponding to the obtained different indexes
S2.5 predicted values at time t based on the time series of all the obtained indicatorsCalculating to obtain an anomaly score value score (t) representing the deviation degree of the index;
S2.6, judging whether the calling edge is abnormal or not based on the obtained anomaly score value score (t).
9. A storage medium, characterized in that it stores a computer program; computer device performing the method according to any of claims 1-7, when said computer program is executed by a processor in the computer device.
CN202111127982.7A 2021-09-26 2021-09-26 Fault root cause positioning method, system and storage medium based on service code level Active CN113900844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111127982.7A CN113900844B (en) 2021-09-26 2021-09-26 Fault root cause positioning method, system and storage medium based on service code level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111127982.7A CN113900844B (en) 2021-09-26 2021-09-26 Fault root cause positioning method, system and storage medium based on service code level

Publications (2)

Publication Number Publication Date
CN113900844A CN113900844A (en) 2022-01-07
CN113900844B true CN113900844B (en) 2024-07-09

Family

ID=79029270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111127982.7A Active CN113900844B (en) 2021-09-26 2021-09-26 Fault root cause positioning method, system and storage medium based on service code level

Country Status (1)

Country Link
CN (1) CN113900844B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114615019B (en) * 2022-02-15 2024-01-16 北京云集智造科技有限公司 Anomaly detection method based on micro-service topological relation generation
CN114598539B (en) * 2022-03-16 2024-03-01 京东科技信息技术有限公司 Root cause positioning method and device, storage medium and electronic equipment
CN114896093A (en) * 2022-05-10 2022-08-12 国泰君安证券股份有限公司 Method, device, processor and storage medium for realizing fault root recommendation processing of multi-component software system based on index correlation
CN115333921B (en) * 2022-08-20 2024-03-29 海南大学 Micro-service abnormal root cause positioning method and device
CN115514617B (en) * 2022-09-13 2024-06-21 上海驻云信息科技有限公司 Universal abnormal root cause positioning and analyzing method and device
CN115509789B (en) * 2022-09-30 2023-08-11 中国科学院重庆绿色智能技术研究院 Method and system for predicting faults of computing system based on component call analysis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160181A (en) * 2015-09-02 2015-12-16 华中科技大学 Detection method of abnormal data of numerical control system instruction field sequence
CN110888755A (en) * 2019-11-15 2020-03-17 亚信科技(中国)有限公司 Method and device for searching abnormal root node of micro-service system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113395108B (en) * 2020-03-12 2022-12-27 华为技术有限公司 Fault processing method, device and system
CN111597070B (en) * 2020-07-27 2020-11-27 北京必示科技有限公司 Fault positioning method and device, electronic equipment and storage medium
CN112698975B (en) * 2020-12-14 2022-09-27 北京大学 Fault root cause positioning method and system of micro-service architecture information system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160181A (en) * 2015-09-02 2015-12-16 华中科技大学 Detection method of abnormal data of numerical control system instruction field sequence
CN110888755A (en) * 2019-11-15 2020-03-17 亚信科技(中国)有限公司 Method and device for searching abnormal root node of micro-service system

Also Published As

Publication number Publication date
CN113900844A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN113900844B (en) Fault root cause positioning method, system and storage medium based on service code level
WO2022068645A1 (en) Database fault discovery method, apparatus, electronic device, and storage medium
CN111858123B (en) Fault root cause analysis method and device based on directed graph network
WO2021213247A1 (en) Anomaly detection method and device
US11348023B2 (en) Identifying locations and causes of network faults
CN110032463B (en) System fault positioning method and system based on Bayesian network
CN113935497A (en) Intelligent operation and maintenance fault processing method, device and equipment and storage medium thereof
CN115237717A (en) Micro-service abnormity detection method and system
CN114785666B (en) Network troubleshooting method and system
CN112415331B (en) Power grid secondary system fault diagnosis method based on multi-source fault information
CN113962273B (en) Multi-index-based time series anomaly detection method and system and storage medium
CN112559237B (en) Operation and maintenance system troubleshooting method and device, server and storage medium
CN115514627B (en) Fault root cause positioning method and device, electronic equipment and readable storage medium
CN114579407B (en) Causal relationship inspection and micro-service index prediction alarm method
CN115421950B (en) Automatic system operation and maintenance management method and system based on machine learning
CN112463892A (en) Early warning method and system based on risk situation
CN111027591B (en) Node fault prediction method for large-scale cluster system
CN114637649A (en) Alarm root cause analysis method and device based on OLTP database system
CN116820826B (en) Root cause positioning method, device, equipment and storage medium based on call chain
CN116668264A (en) Root cause analysis method, device, equipment and storage medium for alarm clustering
CN111144720A (en) Association analysis method and device of operation and maintenance scene and computer readable storage medium
CN115454787A (en) Alarm classification method and device, electronic equipment and storage medium
CN114676021A (en) Job log monitoring method and device, computer equipment and storage medium
CN110738326B (en) Selection method and device of artificial intelligence service system model
CN115408182A (en) Service system fault positioning method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant