CN115640159A - Micro-service fault diagnosis method and system - Google Patents
Micro-service fault diagnosis method and system Download PDFInfo
- Publication number
- CN115640159A CN115640159A CN202211368449.4A CN202211368449A CN115640159A CN 115640159 A CN115640159 A CN 115640159A CN 202211368449 A CN202211368449 A CN 202211368449A CN 115640159 A CN115640159 A CN 115640159A
- Authority
- CN
- China
- Prior art keywords
- micro
- service
- modal
- fault diagnosis
- representations
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a micro-service fault diagnosis method, which comprises the following steps: s1, summarizing historical tracking files, and constructing a micro-service dependence directed topological graph according to a call chain; s2, multi-source monitoring data in the same time window are given, and modeling is carried out on the multi-source monitoring data to extract and obtain multi-modal representations; s3, fusing the multi-modal representations to serve as potential node embedding representations to obtain the in-service behavior representations of the micro services; s4, embedding nodes of the micro-service and inputting the dependency directed topological graph into a graph neural network for modeling to obtain a global system-level state representation; and S5, carrying out abnormality detection and root cause positioning according to the system-level state representation. The invention integrates the abnormal detection and root cause positioning into an end-to-end model, and can simultaneously acquire information in two aspects only by inputting related monitoring data once, thereby realizing more automatic and intelligent micro-service fault diagnosis and being capable of removing faults more conveniently and quickly.
Description
Technical Field
The invention relates to the technical field of micro-service fault diagnosis, in particular to a micro-service fault diagnosis method and system.
Background
Due to the complexity and dynamics of microservice systems, anomalies are inevitable. An exception in one microservice may propagate to other microservices, causing a failure of the hub and even a system crash. Therefore, developers must go through runtime information, such as tracking, system logs, and Key Performance Indicators (KPIs), to closely monitor microservice status in order to discover and resolve potential failures early on.
Past fault diagnosis techniques have focused on either anomaly detection or root cause localization. The anomaly detection technology is used for automatically monitoring whether an anomaly exists in the system or not, and identifying the micro-service with the specific anomaly when the anomaly exists according to the positioning. Previous methods typically utilize statistical models or machine learning techniques to mine information from the traces as trace analysis and monitoring microservice execution and recording important inter-service information. There are two main limitations to the prior art:
1. insufficient utilization of system monitoring information
Unlike the runtime information from which an operation team pays close attention to different sources, the prior art relies deeply on tracking records and makes insufficient use of other data sources. This is mainly due to the complexity of multi-source data analysis, which is much more difficult than single-source data analysis because multi-source data is heterogeneous, often interactive, and large in scale. On the one hand, however, tracking, while containing important information for fault diagnosis, is not sufficient to reveal all typical types of anomalies. On the other hand, different types of data such as logs and KPIs can cooperatively reveal abnormal conditions and bring more clues to the mining of potential faults.
2. Neglecting the connection between the two tasks involved
Typically, the cause location follows the anomaly detection because the anomaly must be discovered prior to analyzing the anomaly. Current research on microservice reliability treats these two phases as independent, although they share inputs and knowledge about microservice status, so existing methods often redundantly process the same inputs (i.e., monitoring information), wasting rich association information between the two. Furthermore, the contradiction between computational efficiency and accuracy makes it impossible to simply combine the most advanced anomaly detectors with root cause locators, since using advanced and complex anomaly detectors and then analyzing the root cause often results in inefficiencies, and applying a simple anomaly detection method may in turn produce many false anomaly detection results that affect the accuracy of the downstream task, i.e., root cause detection.
The above two limitations are important reasons for the low efficiency, accuracy and automation degree in the aspects of abnormality detection and root cause positioning in the prior art in micro-service fault diagnosis.
Disclosure of Invention
The invention aims to solve the technical problems of low efficiency, accuracy and automation degree in the aspects of abnormality detection and root cause positioning during micro-service fault diagnosis, and provides a micro-service fault diagnosis method and system.
The technical problem of the invention is solved by the following technical scheme:
a micro-service fault diagnosis method comprises the following steps:
s1, summarizing historical tracking files, and constructing a micro-service dependence directed topological graph according to a call chain;
s2, multi-source monitoring data in the same time window are given, and modeling is carried out on the multi-source monitoring data to extract and obtain multi-modal representations;
s3, fusing the multi-modal representations to serve as potential node embedding representations to obtain the in-service behavior representations of the micro services;
s4, embedding nodes of the micro-service and inputting the dependency directed topological graph into a graph neural network for modeling to obtain a global system-level state representation;
s5, carrying out anomaly detection and root cause positioning according to the system level state representation, firstly judging whether the micro service system is abnormal or not, and if not, directly outputting a result; and if the micro services are abnormal, outputting the probability that each micro service is the root cause of the abnormality, sequencing the micro services according to the probability, and outputting a sequencing result.
In some embodiments, the following technical features are also included:
the step S1 specifically includes: and extracting a call chain from the history tracking file in a normal state, regarding each micro service as a point, and regarding each call as a directed edge, thereby constructing a dependency directed topological graph.
In step S2, the multi-source monitoring data comprises at least one of a log, a critical index and tracking; the multi-modal characterization comprises at least one of log characterization, criticality index characterization and tracking characterization.
Further, the air conditioner is characterized in that,
modeling the log to extract the characterization specifically includes: firstly, analyzing the log into events, then counting the occurrence times of the events in each time unit to organize an event occurrence time sequence, then modeling by using a Hox model to obtain a density vector, and then embedding the density vector through a full connection layer to obtain a log representation.
Modeling the key indicators to extract the characterization specifically includes: and extracting the correlation between the time sequence dependence and the sequence of the key index by adopting one-dimensional causal convolution to obtain the key index representation.
Modeling the tracking to extract the characterization specifically includes: firstly, extracting delay information from tracking to organize a delay time sequence, and then extracting the time sequence dependency relationship by adopting one-dimensional causal convolution to obtain tracking representation.
In other embodiments, the following technical features are also included:
step S3 specifically includes: and splicing the multi-modal representations in the same characteristic space, and then inputting the spliced multi-modal representations into a gating linear unit to fully fuse the multi-modal representations and filter meaningful information to obtain the in-service behavior representations of each micro-service.
Step S4 specifically includes: and (3) inputting the node embedding of the micro-service and the dependency directed topological graph into an attention graph neural network, learning the dependency relationship among the micro-services, and finally performing attention pooling to obtain a global system-level state representation.
The technical problem of the invention is also solved by the following technical schemes:
a micro-service fault diagnosis system adopting the method comprises a modal-by-modal learning module, a state learning module depending on sensing and a joint detection and positioning module;
the modal-by-modal learning module is used for modeling multi-source monitoring data to extract and obtain multi-modal representations;
the dependency perception state learning module is used for constructing a micro-service dependency directed topological graph, performing multi-mode representation fusion and dependency directed topological graph modeling, so as to obtain global system-level state representation;
the combined detection and positioning module comprises a detector and a root cause positioner and is used for detecting abnormality and positioning the root cause, firstly, the detector is adopted to judge whether the micro service system is abnormal, and if the micro service system is not abnormal, a result is directly output; if the micro services are abnormal, triggering a root cause locator, outputting the probability that each micro service is abnormal root cause by the root cause locator, sequencing the micro services according to the probability, and outputting a sequencing result.
The detector and locator are both comprised of stacked fully connected layers and are jointly trained by sharing one target.
A microservice fault diagnosis device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method as described above when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.
Compared with the prior art, the invention has the advantages that:
according to the micro-service fault diagnosis method and system provided by the invention, the information of multi-source monitoring data is comprehensively considered, so that monitoring data with different sources and different emphasis can be mutually supplemented and exert a synergistic effect, and therefore, accurate and fine-grained modeling can be better carried out on the behaviors in the micro-service and the dependency relationship between the micro-services. Meanwhile, the invention integrates the abnormal detection and root cause positioning into an end-to-end model, so that the information of two aspects can be simultaneously obtained only by inputting related monitoring data once, thereby realizing more automatic and intelligent micro-service fault diagnosis. Moreover, the method can quickly and accurately provide clear clues about the micro-service state, including whether the whole system is in an abnormal state or not and the micro-service list ranked according to the root cause possibility, and is helpful for operation and maintenance engineers to remove faults more conveniently and quickly.
In some embodiments, the invention fully excavates the correlation and shared knowledge between anomaly detection and root cause positioning by using joint training, thereby avoiding repeated modeling of the same input data and avoiding serious negative influence of inaccurate upstream anomaly detection results on the accuracy of downstream root cause positioning.
Other advantages of embodiments of the present invention will be further described below.
Drawings
Fig. 1 is a technical schematic diagram of a micro-service fault diagnosis method in an embodiment of the present invention;
FIG. 2 is a flow chart of a method for diagnosing microservice faults in an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating an effect of a microservice fault diagnosis system according to an embodiment of the present invention.
Detailed Description
Before describing the embodiments of the present invention, the idea of the present invention is described as follows:
similar techniques in the past have focused primarily on anomaly detection or root cause localization, and no technique has integrated both into an end-to-end framework to provide more comprehensive, finer grained fault diagnosis information. This is mainly because such multitask learning techniques have started to develop vigorously in the field of artificial intelligence in recent years, focusing mainly on natural language processing and computer vision. In addition, the concept of intelligent operation and maintenance is started to rise in China only a few years, a more integral software reliability visual field is provided, the conventional invention usually only concerns local technical problems and improves the accuracy of the technology as much as possible, but related technical problems are not considered macroscopically. With the help of the rise of intelligent operation and maintenance, the invention can treat the problem in a more comprehensive view, so that the end-to-end integration of multiple tasks is considered to fully utilize the correlation and shared knowledge between the two tasks.
The invention aims to realize accurate and efficient end-to-end micro-service fault diagnosis (including abnormal detection and root cause positioning). The micro-service system is gradually popularized along with the micro-service system, and the flexibility and the variability of the micro-service system enable the conventional fault diagnosis system depending on a single data source such as tracking to face the challenge, namely the problems of insufficient data mining, inaccurate micro-service state representation and learning and the like exist. In addition, the conventional anomaly detection and root cause positioning are always performed separately, so that on one hand, the efficiency is reduced, and on the other hand, the accuracy of a downstream root cause positioning result is influenced by an inaccurate anomaly detection result. Therefore, the invention integrates two tasks into an end-to-end model through joint training, thereby simultaneously realizing anomaly detection and root cause positioning.
The Eadro provided by the invention is a novel end-to-end architecture, integrates an anomaly detection and root cause analysis technology to automatically eliminate faults for a micro-service system, and considers multivariate monitoring data (including logs, tracking and KPIs). The invention aims to realize efficient, accurate and automatic micro-service abnormity detection and root cause positioning, thereby realizing an end-to-end fault diagnosis technology, providing quality reference information for operation and maintenance personnel and reducing the burden of manual engineers. The core idea of the invention is to learn the behavior in the service based on multi-modal data and capture the dependency relationship between micro-services, thereby deducing the overall view of the system state. The invention comprises three core modules in total, which are respectively: modal-wise learning, perception-dependent state learning, and joint detection and localization. The invention adopts the front-edge multi-mode learning technology to model multi-source monitoring data, integrates anomaly detection and root cause positioning to mine shared knowledge, and can provide richer, more accurate and more fine-grained fault diagnosis basis.
The invention will be further described with reference to the accompanying drawings and preferred embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Some embodiments of the present invention provide a micro-service fault diagnosis method, including the following steps:
s1, summarizing historical tracking files, and constructing a micro-service dependence directed topological graph according to a call chain.
In step S1, a call chain is extracted from a history tracking file in a normal state, each micro service is regarded as a point, and each call is regarded as a directed edge, so that a dependency directed topological graph is constructed.
And S2, multi-source monitoring data in the same time window are given, and modeling is carried out on the multi-source monitoring data to extract and obtain multi-modal representations.
In step S2, the multi-source monitoring data comprises at least one of logs, critical indexes and tracking; the multi-modal characterization includes at least one of a log characterization, a key index characterization, and a tracking characterization. Modeling the log to extract the characterization specifically includes: firstly, analyzing the log into events, then counting the occurrence times of the events in each time unit to organize the event occurrence time sequence, then modeling by using a Hox model to obtain a density vector, and embedding the density vector through a full connection layer to obtain a log representation. Modeling the key indicators to extract the characterization specifically includes: and extracting the correlation between the time sequence dependence and the sequence of the key index by adopting one-dimensional causal convolution to obtain the key index representation. Modeling the tracking to extract the characterization specifically includes: firstly, extracting delay information from tracking to organize a delay time sequence, and then extracting the time sequence dependency relationship by adopting one-dimensional causal convolution to obtain tracking representation.
And S3, fusing the multi-modal representations to serve as potential node embedded representations to obtain the in-service behavior representations of the micro services.
And step S3, splicing the multi-mode representations in the same characteristic space, and then inputting the spliced multi-mode representations into a gate control linear unit to fully fuse the multi-mode representations and filter out meaningful information to obtain the in-service behavior representations of each micro-service.
And S4, embedding nodes of the micro-service and inputting the dependent directed topological graph into a graph neural network for modeling to obtain global system-level state representation.
And step S4, embedding nodes of the micro-services and inputting the dependency directed topological graph into an attention graph neural network, learning the dependency relationship among the micro-services, and finally performing attention pooling to obtain global system-level state representation.
S5, carrying out anomaly detection and root cause positioning according to the system level state representation, firstly judging whether the micro service system is abnormal or not, and if not, directly outputting a result; and if the micro services are abnormal, outputting the probability that each micro service is the root cause of the abnormality, sequencing the micro services according to the probability, and outputting a sequencing result.
Some embodiments of the present invention further provide a micro-service fault diagnosis system using the method described above, including: the device comprises a modal-by-modal learning module, a perception-dependent state learning module and a joint detection and positioning module. The modal-by-modal learning module is used for modeling multi-source monitoring data to extract and obtain multi-modal representations; the dependency perception state learning module is used for constructing a micro-service dependency directed topological graph, performing multi-mode representation fusion and dependency directed topological graph modeling, so as to obtain global system-level state representation; the combined detection and positioning module comprises a detector and a root cause positioner and is used for detecting abnormality and positioning the root cause, firstly, the detector is adopted to judge whether the micro-service system is abnormal, if not, the result is directly output; if the micro services are abnormal, triggering a root cause locator, outputting the probability that each micro service is abnormal root cause by the root cause locator, sequencing the micro services according to the probability, and outputting a sequencing result. The detector and locator are both comprised of stacked fully connected layers and are jointly trained by sharing one target.
Further embodiments of the present invention provide a microservice fault diagnosis apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method as described above when executing the computer program.
Further embodiments of the invention also provide a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the method as described above.
Example 1
The embodiment of the invention provides a micro-service fault diagnosis system, the technical principle of which is shown in figure 1, and the system specifically comprises the following three parts:
1. modal-by-modal learning
This module is intended to model the monitoring data from different sources separately, and the present technique learns the information representation of each modality using a specific model for each modality's characteristics.
For the logs, the frequency of the events corresponding to each log in a unit event is counted, and then the frequency of the log is modeled by means of a Hox self-excitation model. This is because the past event increases the likelihood that the event will occur in the near future, consistent with the assumptions of the free-running process. Specifically, log modeling is divided into three steps: 1) Given a log file x, firstly, analyzing the log into events through a log analyzer of Drain, namely deleting variables in log messages and reserving constants written by programmers; 2) Then recording the time stamp of the event occurrence (relative to the start time stamp of the observation window) to estimate the parameters of the hokes model with exponential decay kernels in the window, thereby converting the log file x into a density vector lambda; 3) And embedding the density vector lambda into a feature space with a specified dimension through a full connection layer, thereby obtaining a log characterization HL.
Aiming at KPI, firstly acquiring multivariate KPI monitoring data with the length of T from each microservice, then learning the time dependence and the cross-sequence relation of the KPI by using a lightweight and parallelizable one-dimensional expansion causal convolution layer, and then acquiring a more reasonable KPI representation by applying a self-attention mechanism, so that the KPI is embedded into the same feature space as a log representation to obtain a KPI representation HK.
For tracking, the present invention first extracts the delay from the tracking file, i.e. the time taken by a micro service to invoke another micro service, and converts it into a time sequence by calculating the average delay of each callee in a time slot, so as to obtain a single variable delay time sequence with length T at each micro service (i.e. callee). Similar to KPI learning, a delay time series is input into a one-dimensional dilated causal convolution layer, and then a self-attention operation is used to learn the potential representation, resulting in a tracking characterization HT.
2. Perception dependent state learning
The goal of this module is to model the state of the microservice and characterize the overall state of the system. The module consists of three steps, namely dependency graph construction, multi-modal fusion and dependency graph modeling. Specifically, first the relationships between the micro-services are described and represented as a directed graph are extracted from the history trace, and then the multi-modal representations obtained from the previous stages are fused as potential node-embedded representations to characterize the state of the service level. The messages in the constructed directed graph are propagated through the graph neural network to learn the adjacent dependency relationships expressed by the edge weights, and finally the dependency perception representation representing the overall state of the micro-service system can be obtained.
The dependency graph construction mainly comprises the steps of extracting all call records from history tracking, and constructing a directed graph G by regarding micro-services involved in the history tracking as nodes and regarding call relations as directed edges.
And in the multi-mode representation fusion step, the log representation HL and the KPI representation HK obtained in the last stage and the tracking representation HT are spliced to obtain a large vector. In order to reduce vector dimension and unnecessary calculation consumption, vectors obtained by splicing are input into a Gated Linear Unit (GLU) to be fused and represented in a nonlinear mode and potential redundancy is filtered. The GLU is able to control the bandwidth of the information stream and reduce the problem of vanishing gradients. It also has extraordinary resilience to catastrophic forgetfulness, so the GLU is well suited for the scenario of the application of the present invention, as there is a large amount of data and a complex stacked neural layer in the scenario. The GLU outputs a multimodal fusion representation for each microservice, represented as HM as an embedding of node features to represent the state of the microservice level.
Dependency graph modeling aims at obtaining overall system-level state characterization by comprehensively analyzing the state of the microservice level. Given that interactions between microservices can be naturally described by dependency graphs, the present invention applies graph neural networks for modeling to perform downstream classification reasoning. In particular, the present invention uses a Graph Attention network (GAT) to learn the dependent perception state of a microservice system. GAT supports learning node and edge representations and dynamically assigns weights to neighbors without performing computationally intensive spectral decomposition. Thus, the model may focus on micro-services with abnormal behavior or micro-services located at the communication hub. After GAT is used for modeling the dependence between the micro services, global attention pooling is performed on the multi-modal representations of all the nodes to obtain the state representation of the whole micro service system, and finally the dependence perception representation HF of the whole system state is obtained.
3. Joint detection and localization
The last module integrates two functions of anomaly detection and root cause positioning. This module first predicts whether the current observation window is abnormal, and if so, determines which microservice the root is, and makes full use of shared knowledge and integrates two closely related tasks into the end-to-end model. Specifically, a detector composed of a fully connected layer and an activation function is first constructed to perform binary classification to determine the presence of an anomaly. If no abnormity exists, directly outputting the result; if there is an anomaly, a locator is triggered to rank the microservices according to their probability of becoming a root cause. Both the detector and the locator are composed of stacked fully connected layers and are jointly trained by sharing one target. During the training process, the detector is aimed at minimizing the binary cross entropy loss L1, then all samples predicted to be normal (0) are masked and samples predicted to be abnormal (1) are passed through the locator. The localizer tries to reduce the distance between the predicted probability and the true probability, with the goal of minimizing the multivariate cross entropy loss L2, and finally the optimization goal of this module is a weighted sum of two sub-goals, i.e., β × L1+ (1- β) × L2, where the parameter β is an artificially defined parameter, with values of 0 to 1, typically 0.5. Finally, the invention outputs a sorted micro-service list which can be checked one by one according to the prediction probability of the micro-service list as the root.
Example 2
The embodiment of the invention also provides a micro-service fault diagnosis method, and the flow of the method is shown in figure 2. Firstly, historical tracking files are collected, and a micro-service dependence directed topological graph is drawn according to a call chain. Then, given multi-source monitoring data in the same time window, including logs, tracking and KPI, the modal-by-modal learning module of the invention respectively models three kinds of data of different sources and different modes to extract representation, and the representation is embedded as the characteristic of the internal expression of each micro-service. The point embedding of the microservices and the previously acquired dependency graph are input into a dependency-aware state learning module to learn system-level state characterizations. And performing characterization learning according to the return gradient of the subsequent joint detection and positioning module in a training mode, and inputting the representation learning into a downstream joint detection and positioning module in a detection mode for judgment. The combined detection and positioning module firstly judges whether the system is abnormal according to the system-level state representation, if not, the sample without abnormality is shielded in a training mode, the loss of the positioner is not counted, and the result is directly output in a detection mode; if the micro services exist, the locator is started, the locator outputs the probability that each micro service is the root cause of the abnormality, the micro services are sequenced according to the probability, and the sequencing result is output.
Verification example
The Eadro method provided by the embodiment of the invention is verified by a micro-service troubleshooting system, and the specific presentation effect is shown in FIG. 3. The two charts in the leftmost column visualize key indexes, including write-in and write-out speed and CPU utilization rate related indexes; the current system state and the most possible root cause micro-service name are displayed on the upper side of the middle column, and the communication delay condition among the micro-services extracted from tracking is visualized on the lower side; the upper part of the right column shows the topological relation among the micro-services extracted from the historical tracking, and the lower part shows the micro-service list provided by the Eadro of the embodiment of the invention, and the micro-services are sorted according to the probability of the micro-services as root cause; the two rightmost buttons allow the user to download raw log files and trace files for local analysis.
In order to verify the accuracy of the embodiment of the invention in the aspect of anomaly detection, the anomaly detection experiments are respectively carried out by taking the operation data collected from two reference microservices, namely Train packet (TT) and Social Network (SN), and the obtained results are shown in table 1:
TABLE 1
In table 1, pre (Precision), rec (reduce), and F1 are three indexes used to measure the classification accuracy, and the higher the index is, the better the effect of the method is, and the calculation method is: pre = TP/(TP + FP), rec = TP/(TP + FN), F1=2 × Precision Recall/(Precision + Recall), where TP represents the number of samples in which an abnormality is correctly detected, FP represents the number of normal samples that are erroneously marked as an abnormality, and FN represents the number of samples in which an abnormality is not recognized. The methods (Approaches) in the above table include traceanomallly (a Trace call chain anomaly detection model), multimodetrace (an anomaly detection model based on Trace and KPI), and eador, and as can be seen from the above table results, the embodiment of the present invention has a certain effect in anomaly detection compared with the previous similar techniques.
To further verify the accuracy of the embodiments of the present invention in root cause location, the data in table 1 are used to perform root cause location experiments on microservice faults, and the obtained results are shown in table 2:
TABLE 2
In Table 2, HR @ K and NDCG @ K represent the pre-K hit rate and the pre-K normalized discount cumulative gain, respectively, for evaluating the performance of the locator, where K =1,3,5.HR @ K calculates the overall probability in the output list, based on the previous K prediction candidates,measure rank quality, where p j Is the predicted probability that the jth microservice is the root, M is the number of microservices, and N is the number of samples. Obviously, the higher these two indices are, the more accurately the root cause is located. The methods (Approaches) in table 2 include TBAC, netMedic, monitorRank, cloudRanger, dyCause, and eador, and the former 5 are all prior arts (all realize root cause positioning of micro service failure), and as can be seen from the results in table 2, the embodiments of the present invention have significant effect improvement in root cause positioning compared with the previous similar techniques.
In some embodiments of the present invention, a microservice fault diagnosis method is provided, which differs from the above embodiments in that one of the three data sources, namely log, KPI and trace, is eliminated, because in an actual industrial scenario there may be situations where the input collection function is not deployed well at the beginning of the project.
In some other embodiments of the present invention, a micro-service fault diagnosis method is provided, which includes the following steps: g1, integrating more data sources, such as alarms, and performing characterization learning by adopting a typical alarm modeling method; g2, learning the characterization of each data source by using other modeling modes except the modeling mode of the embodiment of the invention, for example, obtaining log characterization by using a natural language processing technology instead of using a Hox self-excitation model.
In other embodiments of the present invention, a software system, such as a distributed software system, may also use the foregoing method of embodiments of the present invention for anomaly detection and root cause localization.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is a more detailed description of the invention in connection with specific/preferred embodiments and is not intended to limit the practice of the invention to those descriptions. It will be apparent to those skilled in the art that numerous alterations and modifications can be made to the described embodiments without departing from the inventive concepts herein, and such alterations and modifications are to be considered as within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "a preferred embodiment," "an example," "a specific example" or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. Although embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the claims.
Claims (12)
1. A micro-service fault diagnosis method is characterized by comprising the following steps:
s1, summarizing historical tracking files, and constructing a micro-service dependency directed topological graph according to a call chain;
s2, multi-source monitoring data in the same time window are given, and modeling is carried out on the multi-source monitoring data to extract and obtain multi-modal representations;
s3, fusing the multi-modal representations to serve as potential node embedding representations to obtain the in-service behavior representations of the micro services;
s4, embedding nodes of the micro service and inputting the dependency directed topological graph into a graph neural network for modeling to obtain a global system level state representation;
s5, carrying out anomaly detection and root cause positioning according to the system level state representation, firstly judging whether the micro service system is abnormal or not, and if not, directly outputting a result; and if the micro services are abnormal, outputting the probability that each micro service is the root cause of the abnormality, sequencing the micro services according to the probability, and outputting a sequencing result.
2. The micro-service fault diagnosis method according to claim 1, wherein step S1 specifically comprises: and extracting a call chain from the history tracking file in a normal state, regarding each micro service as a point, and regarding each call as a directed edge, thereby constructing a dependency directed topological graph.
3. The micro-service fault diagnosis method of claim 1, wherein in step S2, the multi-source monitoring data includes at least one of a log, a key index, and a trace; the multi-modal characterization includes at least one of a log characterization, a key index characterization, and a tracking characterization.
4. The microservice fault diagnosis method of claim 3, wherein modeling the log to extract the characterization specifically comprises: firstly, analyzing the log into events, then counting the occurrence times of the events in each time unit to organize an event occurrence time sequence, then modeling by using a Hox model to obtain a density vector, and then embedding the density vector through a full connection layer to obtain a log representation.
5. The microservice fault diagnosis method of claim 3, wherein modeling the criticality index to extract the characterization specifically comprises: and extracting the correlation between the time sequence dependence and the sequence of the key index by adopting one-dimensional causal convolution to obtain the key index representation.
6. The microservice fault diagnostic method of claim 3, wherein modeling the tracking to extract the characterization specifically comprises: firstly, extracting delay information from tracking to organize a delay time sequence, and then extracting the time sequence dependency relationship by adopting one-dimensional causal convolution to obtain tracking representation.
7. The micro-service fault diagnosis method according to claim 1, wherein step S3 specifically comprises: and splicing the multi-modal representations in the same characteristic space, and then inputting the spliced multi-modal representations into a gating linear unit to fully fuse the multi-modal representations and filter meaningful information to obtain the in-service behavior representations of each micro-service.
8. The micro-service fault diagnosis method according to claim 1, wherein step S4 specifically comprises: and (3) inputting the node embedding of the micro-service and the dependency directed topological graph into an attention graph neural network, learning the dependency relationship among the micro-services, and finally performing attention pooling to obtain a global system-level state representation.
9. A microservice fault diagnosis system employing the method of any of claims 1-8, comprising: the device comprises a modal-by-modal learning module, a dependency perception state learning module and a joint detection and positioning module;
the modal-by-modal learning module is used for modeling multi-source monitoring data to extract and obtain multi-modal representations;
the dependency perception state learning module is used for constructing a micro-service dependency directed topological graph, performing multi-mode representation fusion and dependency directed topological graph modeling, so as to obtain global system-level state representation;
the combined detection and positioning module comprises a detector and a root cause positioner and is used for detecting abnormality and positioning the root cause, firstly, the detector is adopted to judge whether the micro service system is abnormal, and if the micro service system is not abnormal, a result is directly output; if the micro services are abnormal, triggering a root cause locator, outputting the probability that each micro service is abnormal root cause by the root cause locator, sequencing the micro services according to the probability, and outputting a sequencing result.
10. The microservice fault diagnostic system of claim 9, wherein the detector and locator are comprised of stacked fully connected layers and are jointly trained by sharing a single target.
11. A microservice fault diagnosis device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-8 when executing the computer program.
12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211368449.4A CN115640159A (en) | 2022-11-03 | 2022-11-03 | Micro-service fault diagnosis method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211368449.4A CN115640159A (en) | 2022-11-03 | 2022-11-03 | Micro-service fault diagnosis method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115640159A true CN115640159A (en) | 2023-01-24 |
Family
ID=84946751
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211368449.4A Pending CN115640159A (en) | 2022-11-03 | 2022-11-03 | Micro-service fault diagnosis method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115640159A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116187982A (en) * | 2023-04-24 | 2023-05-30 | 成都盛锴科技有限公司 | Rail transit vehicle multi-source multi-mode fault risk reasoning and maintenance decision method |
CN116450399A (en) * | 2023-06-13 | 2023-07-18 | 西华大学 | Fault diagnosis and root cause positioning method for micro service system |
CN116661426A (en) * | 2023-07-14 | 2023-08-29 | 创域智能(常熟)网联科技有限公司 | Abnormal AI diagnosis method and system of sensor operation control system |
CN117560275A (en) * | 2023-12-29 | 2024-02-13 | 安徽思高智能科技有限公司 | Root cause positioning method and device for micro-service system based on graphic neural network model |
-
2022
- 2022-11-03 CN CN202211368449.4A patent/CN115640159A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116187982A (en) * | 2023-04-24 | 2023-05-30 | 成都盛锴科技有限公司 | Rail transit vehicle multi-source multi-mode fault risk reasoning and maintenance decision method |
CN116187982B (en) * | 2023-04-24 | 2023-08-11 | 成都盛锴科技有限公司 | Rail transit vehicle multi-source multi-mode fault risk reasoning and maintenance decision method |
CN116450399A (en) * | 2023-06-13 | 2023-07-18 | 西华大学 | Fault diagnosis and root cause positioning method for micro service system |
CN116450399B (en) * | 2023-06-13 | 2023-08-22 | 西华大学 | Fault diagnosis and root cause positioning method for micro service system |
CN116661426A (en) * | 2023-07-14 | 2023-08-29 | 创域智能(常熟)网联科技有限公司 | Abnormal AI diagnosis method and system of sensor operation control system |
CN116661426B (en) * | 2023-07-14 | 2023-09-22 | 创域智能(常熟)网联科技有限公司 | Abnormal AI diagnosis method and system of sensor operation control system |
CN117560275A (en) * | 2023-12-29 | 2024-02-13 | 安徽思高智能科技有限公司 | Root cause positioning method and device for micro-service system based on graphic neural network model |
CN117560275B (en) * | 2023-12-29 | 2024-03-12 | 安徽思高智能科技有限公司 | Root cause positioning method and device for micro-service system based on graphic neural network model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Schlegel et al. | Towards a rigorous evaluation of XAI methods on time series | |
CN115640159A (en) | Micro-service fault diagnosis method and system | |
RU2573735C2 (en) | Method and system for analysis of flight data recorded during aircraft flight | |
US9052711B2 (en) | Method for recognising sequential patterns for a method for fault message processing | |
CN111782460A (en) | Large-scale log data anomaly detection method and device and storage medium | |
JPH08234832A (en) | Device and method for monitoring and diagnostic plant | |
CN116882978B (en) | Deep sea submersible operation and maintenance support system based on product information frame | |
KR102359090B1 (en) | Method and System for Real-time Abnormal Insider Event Detection on Enterprise Resource Planning System | |
CN115048370A (en) | Artificial intelligence processing method for big data cleaning and big data cleaning system | |
CN114860542A (en) | Trend prediction model optimization method, trend prediction model optimization device, electronic device, and medium | |
US11243835B1 (en) | Message-based problem diagnosis and root cause analysis | |
Prasidis et al. | Handling uncertainty in predictive business process monitoring with Bayesian networks | |
CN116756021A (en) | Fault positioning method and device based on event analysis, electronic equipment and medium | |
Lomio et al. | A machine and deep learning analysis among SonarQube rules, product, and process metrics for fault prediction | |
CN113242213A (en) | Power communication backbone network node vulnerability diagnosis method | |
Weiss | Predicting telecommunication equipment failures from sequences of network alarms | |
Yang et al. | A multi-components approach to monitoring process structure and customer behaviour concept drift | |
CN112128950B (en) | Machine room temperature and humidity prediction method and system based on multiple model comparisons | |
Bermejo et al. | Interactive learning of Bayesian networks using OpenMarkov | |
CA2723737A1 (en) | Assisting failure diagnosis in a system | |
Telli et al. | Detecting Novel Behavior and Process Enhancement with Multimodal Process Mining | |
EP4134872A1 (en) | Method for automatically detecting anomalies in log files | |
US20230376795A1 (en) | Device, computing platform and method of analyzing log files of an industrial plant | |
CN115329962A (en) | Visual interpretation method of normal form graph model | |
CN116049642A (en) | Fault diagnosis method, system, electronic equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |