CN117560275A - Root cause positioning method and device for micro-service system based on graphic neural network model - Google Patents
Root cause positioning method and device for micro-service system based on graphic neural network model Download PDFInfo
- Publication number
- CN117560275A CN117560275A CN202311854026.8A CN202311854026A CN117560275A CN 117560275 A CN117560275 A CN 117560275A CN 202311854026 A CN202311854026 A CN 202311854026A CN 117560275 A CN117560275 A CN 117560275A
- Authority
- CN
- China
- Prior art keywords
- micro
- neural network
- network model
- service
- root cause
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003062 neural network model Methods 0.000 title claims abstract description 55
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000002159 abnormal effect Effects 0.000 claims abstract description 35
- 238000012549 training Methods 0.000 claims abstract description 10
- 238000004364 calculation method Methods 0.000 claims abstract description 6
- 238000005070 sampling Methods 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 8
- 238000005295 random walk Methods 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 230000004931 aggregating effect Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 claims description 2
- 230000004044 response Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 abstract description 8
- 230000005856 abnormality Effects 0.000 abstract description 7
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000008859 change Effects 0.000 abstract description 3
- 238000010801 machine learning Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 8
- 230000002776 aggregation Effects 0.000 description 7
- 238000004220 aggregation Methods 0.000 description 7
- 238000013461 design Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 241000238814 Orthoptera Species 0.000 description 1
- 102100040160 Rabankyrin-5 Human genes 0.000 description 1
- 101710086049 Rabankyrin-5 Proteins 0.000 description 1
- 230000004308 accommodation Effects 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/16—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Quality & Reliability (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Testing And Monitoring For Control Systems (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention relates to a root cause positioning method and a root cause positioning device of a micro-service system based on a graph neural network model, comprising the following steps: constructing a graph neural network model; training the graph neural network model by using the historical fault multidimensional time sequence performance index to obtain a trained graph neural network model; constructing a heterogeneous topological graph of the micro service system at an instance level through the collected real-time micro service topological structure and the call relationship; adjusting the abnormal weight of each micro service node by combining the service request link; and inputting the root cause candidate set and the real-time index characteristic data of the abnormal time window into a graph neural network model, and obtaining final root cause and the root cause abnormal type after characteristic weighting. The device is used for realizing the method. The invention has the beneficial effects that: the abnormality of the micro-service system can be detected rapidly and accurately, and the positioning granularity is reduced to an instance level; the dynamic change of the micro-service system is well adapted by effectively combining the machine learning model with the dynamic diagram calculation method.
Description
Technical Field
The invention relates to the field of fault positioning of server systems, in particular to a root cause positioning method and device of a micro-service system based on a graph neural network model.
Background
With the development of the internet, cloud computing and computer industries, more and more systems are designed and built by adopting a micro-service architecture, and the micro-service architecture is widely applied to various actual scenes, for example: large enterprise applications, internet of things applications, and cloud services. The micro-service architecture can bring high availability, high expansibility and elastic expansion capability to the system so as to better adapt to the requirements of the current large-scale software application. In recent years, the concept of a cloud native software architecture has been developed as a method for constructing and running an application program, which makes the application program need to consider the running scene of the cloud environment at the time of design. The micro-service is one of the core points of the cloud native software architecture, the cloud native software architecture requires an application program to be designed and constructed in the form of the micro-service, communication and interaction are carried out between the services through a RESTful API, and the cloud native software architecture can fully utilize the capability of high cloud availability and high accommodation, so that the application program can be finally loaded and supported by the cloud in the form of a container. On the basis of micro-services, the cloud native software architecture can be transversely expanded in a very large scale, and has high availability and safety. However, how to better guarantee the reliability and observability of a large-scale micro-service system, and to better locate the service root cause when an abnormality occurs, are also facing a number of difficulties. An effective method is designed to automatically help operation and maintenance personnel to locate the root cause of the fault, which has important significance.
Currently, challenges to the micro-service root due to the localization problem are: 1) The positioning granularity is too large: the current micro-service root can be basically positioned to the micro-service granularity only and cannot be positioned to the micro-service embodiment granularity, but in a real scene, a certain micro-service instance or a container abnormality where the micro-service embodiment is located eventually causes jitter, and when the micro-service has a plurality of instances, the micro-service instance which should be checked or restarted cannot be known well. 2) The monitoring indexes are as follows: the index data which can be collected by the monitoring system not only comprises the index data of the micro service level, but also comprises the index data of the micro service instance, the container where the micro service instance is located and the host where the container is located, and the multi-dimensional data of the micro service system can be fully utilized to further position the micro service abnormal root cause with finer granularity, namely the service instance level. 3) Abnormal root cause type is ambiguous: root cause positioning is the first step when the micro-service system is abnormal, the type of the root cause abnormality is better screened, key information can be provided for the subsequent maintenance and repair process, and the current root cause positioning method is less related to the research and discussion of the angle.
In a micro-service system, a service is a collection of service instances, which are the smallest units that carry and run the actual business processes. After the service receives the request, the request is routed to the designated service instance through a variety of different load balancing policies. Dynamic changes in service instances are frequent and difficult to predict, coupled with constraints on system resources, traffic size, and bearer capability, and different resource constraints on different instances are often the root cause of anomalies in single or multiple service instances.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a root cause positioning method of a micro-service system based on a graph neural network model, which comprises the following steps:
s1, collecting historical fault multidimensional time sequence performance indexes of a micro-service system;
s2, constructing a graph neural network model; training the graph neural network model by using the historical fault multidimensional time sequence performance index to obtain a trained graph neural network model;
s3, constructing a heterogeneous topological graph of the micro service system at an instance level through the collected real-time micro service topological structure and the collected calling relationship;
s4, adjusting the abnormal weight of each micro service node by combining the service request link;
and S5, inputting the root cause candidate set and the real-time index characteristic data of the abnormal time window into a graph neural network model, and obtaining final root cause and the root cause abnormal type after characteristic weighting.
A micro-service system root cause positioning device based on a graph neural network model, comprising: a processor and a storage device; the processor loads and executes instructions and data in the storage device, and the instructions and data are used for realizing the root cause positioning method of the micro-service system based on the graph neural network model.
The beneficial effects provided by the invention are as follows: according to the method for positioning the abnormal root cause of the micro-service system based on the time sequence node sampling graph neural network model and the random walk algorithm, 1) the abnormality of the micro-service system can be detected rapidly and accurately, and the positioning granularity is reduced to an instance level; 2) The dynamic change of the micro-service system is well adapted by effectively combining the machine learning model with the dynamic diagram calculation method.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a neural network model design of an embodiment of the present invention;
FIG. 3 is a diagram illustrating heterogeneous topologies of an embodiment of the invention;
fig. 4 is a schematic view of the structure of the device of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a schematic diagram of a process flow of the present invention; the invention provides a root cause positioning method of a micro-service system based on a graph neural network model, which specifically comprises the following steps:
s1, collecting historical fault multidimensional time sequence performance indexes of a micro-service system;
it should be noted that, the multi-dimensional time sequence performance index in step S1 includes: an index of a micro service level, an index of a micro service instance level, and an index of a host level.
As an embodiment, the index of the micro service level includes: the relation of the service level serves the grid network delay fault and network packet loss fault;
an index of service instance level, comprising: instance level CPU high load fault, memory high load fault, instance network delay fault and instance abnormal termination fault;
host level metrics, including: high load faults of a host level CPU, high load faults of a memory, high load faults of file reading and writing and the like.
S2, constructing a graph neural network model; training the graph neural network model by using the historical fault multidimensional time sequence performance index to obtain a trained graph neural network model;
referring to fig. 2, fig. 2 is a schematic diagram of a neural network model design according to an embodiment of the invention.
The step S2 is specifically as follows:
s21, carrying out data sampling on the multi-dimensional time sequence performance index according to a fixed time interval to obtain sampling points; inputting the sampling points to an encoder of the graph neural network model to obtain index data characteristics of the sampling points;
the multi-dimensional index data for each sample point is normalized to a value between (0, 1) before being input to the encoder. The encoder refers to the idea of word embedding model, receives a certain number of node index data, the number of nodes is similar to the number of words, the micro-service multi-dimensional index data of each time sequence node is similar to the feature vector of the words, and the dimension conversion is needed because the dimension of the micro-service index data is influenced by the micro-service topological structure at the current moment, and the dimension conversion is needed to be converted into a unified feature dimension to represent the feature of each time sequence node for training of the neural network model of the subsequent graph.
S22, taking index data characteristics of the same abnormal interval sampling points as nodes of the graph network; connecting nodes in the same abnormal section according to time sequence to form the edge of the nodes in the same abnormal section of the graph network;
it should be noted that, for nodes of different anomaly time intervals of the same anomaly type, according to the time lapse of anomaly injection, connecting lines are formed between the nodes to form edges of a graph network similar anomaly characteristic interval;
s23, inputting the feature vector of each node and the feature vector of the adjacent node into an aggregator of the graph neural network model according to a fixed sampling number, and aggregating by adopting a convolution layer;
the feature vector of each node and the feature vector of the adjacent node are calculated according to a fixed sampling numbernInput into the aggregator for convolution aggregation, and sampling numbernRepresenting each nodenFeature aggregation is carried out on adjacent nodes, and the adjacent nodes are smaller thannThen the aggregation process takes the number of adjacent nodes as the sampling number, and the node are combinednAnd inputting the feature vectors of the sampling points into the encoder again to perform secondary coding aggregation, and performing feature aggregation on the index features of the adjacent nodes and the current node based on a recursion idea. Each recursion process is called a convolution layer, and proper quantity of convolution layers can be reasonably selected according to the micro-service cluster scale and the index data dimension to balance the model training time and the node characteristic aggregation degree;
in the first placekWhen the sub-graph is rolled up, the same time is takenVIs carried out by micro service index data, micro service instance index data and host machine index dataCONCATMerging, feature extraction of associated index data using a method similar to word embedding representation。
S24, selecting proper number of convolution layers, marking corresponding fault type labels for different abnormal time windows, training the graph neural network model, and outputting the trained graph neural network model when the classification loss function converges to an expected value.
Randomly selecting a fixed number n of adjacent node sets among all time sequence nodes in the same abnormal intervalN(V) By means ofMEANThe method performs feature aggregation. Through the process ofkThe whole process of the secondary graph convolution is expressed asWhereinCONCATRepresenting merging and stitching nodes and feature dimensions of adjacent nodes, < >>Indicating that the adjacent node is at the firstkFeature sets of the secondary graph convolution.
S3, constructing a heterogeneous topological graph of the micro service system at an instance level through the collected real-time micro service topological structure and the collected calling relationship;
the step S3 is specifically as follows:
s31, when a fault occurs, constructing a real-time topological graph according to the topological structure of the micro-service system and call link data;
s32, combining the index of the micro service level collected in the step S1 to give a firstmPersonal service node weightingServicemAnd service->There is a direct call relationship between them to the service nodes m Service nodes n Assigning weights to data edges in combination with service invocation delay indicatorss m -s n ];
S33, combining the indexes of the micro-service example level collected in the step S1 to give the firstmThe first of the individual servicesjPersonal instance nodesi mj Weight is givenWhereinRepresenting instance nodessi mj Container CPU load->Indicating the load of the memory in the container,representing the network load of the container>Representing container throughput, +.>Representing the success rate of the request response of the container, and then calculating the correlation degree of various different index sequences according to the Pearson correlation coefficient; will eventually bemThe first of the individual servicesjThe example edge gives the greatest relevance +.>;
S34, combining the host level indexes collected in the step S1 to give a firstkThe individual hosts assign weightsWherein->Representing the CPU load of the host machine,representing the memory load of the host,/->Representing the network load of a host, and then calculating the correlation degree of various different index sequences according to the Pearson correlation coefficient; finally, the maximum correlation is given to the host machine and all the example node edges on the host machine>。
The calculation formula of the pearson correlation coefficient is as follows:
wherein the method comprises the steps ofx,yTwo sequence data for which correlation needs to be calculated.
S4, adjusting the abnormal weight of each micro service node by combining the service request link;
the step S4 is specifically as follows:
s41, give service nodeAssigning a personalized value as an average value of all the connecting edge weights of the personalized value, wherein the personalized value comprises the following components: directly calling edges and subordinate edges of all instances of the service and the service between service nodes;
s42, giving example nodesi mj Giving an individuation value as an edge weight value of the service to which the individuation value belongs;
s43, giving host noden k Giving personalized value as average value of edge weight value of the personalized value and all examples on the host;
s44, adopting a personalized random walk algorithm to sort the abnormal degrees of all nodes in a descending order on the heterogeneous topological graph to generate a preliminary root cause candidate set.
The calculation method of the personalized random walk adopts the following formula to calculate:
wherein the method comprises the steps ofvRepresenting the final scoring result of the node, and ranking the results of the instance root cause positioning simultaneously;Pin order to personalize the array of data,cto continue the probability of random walk forward,uscoring the result for the next node. After multiple rounds of walk iterations, the scoring results for each node will tend to converge, producing a preliminary set of root cause candidates.
And S5, inputting the root cause candidate set and the real-time index characteristic data of the abnormal time window into a graph neural network model, and obtaining final root cause and the root cause abnormal type after characteristic weighting.
The step S5 is specifically as follows:
s51, when a micro-service system runs in real time and fails, collecting multi-dimensional index data of the whole cluster in an abnormal time window;
s52, inputting the multidimensional index data into the neural network model trained in the step S2 to obtain classification weights of different types of root causes in the real-time abnormal interval;
and S53, carrying out product operation on the root cause candidate set obtained in the step S4 and the classification weight in the step S52 to obtain the final root cause ranking and the abnormal type, wherein the higher the ranking is, the more likely the root cause is.
As an example, the present invention is illustrated in Hipster Shop;
early stage preparation of experiment: the experimental environment is three Ubuntu physical machines, kubernetes, istio and Prometheus are installed on the physical machines. The Hipster Shop micro-service system is used as an example: wherein Hipster Shop is a micro-service business demonstration application comprising 12 micro-services. The application is a Web-based e-commerce application in which a user can browse goods, add the goods to a shopping cart, and make purchases. Including 8 business micro-services and 4 analog micro-services to implement the shopping process. The hardware and software information for a particular environment is shown in table 1.
The injected fault types and data set sizes are shown in table 2.
In order to simulate a real user scene, the embodiment uses the locusts as simulated concurrency generators to generate different workloads for simulating user concurrency behaviors for different business scenes. Meanwhile, in order to simulate the performance problem of a real environment, the following common anomalies are injected by adopting a chaos engineering tool ChaoMesh. 1) Delay; 2) Container instance CPU load; 3) A container instance memory load; 4) Network packet loss of container examples; 5) The container process stops. Collecting historical multidimensional index data of the occurrence of the abnormality;
table 1 hardware and software information table of the environment of the embodiment of the present invention
Table 2 injected fault types and data set sizes
The index data reported in the examples are shown in table 3.
The training parameters of the graph neural network model trained in the present invention are shown in table 4.
Table 3 index data reported by examples
TABLE 4 training parameters for neural network models
Referring to fig. 3, as shown in fig. 3, when a real-time fault occurs, a heterogeneous topology graph including all micro service nodes, instance nodes and host nodes is constructed, and weights are given to nodes and edges of the heterogeneous topology graph by combining cluster multi-dimension index data in the time interval.
And calculating the personalized array value of each node, and executing a personalized random walk algorithm to obtain a final root cause ranking list.
Finally, according to the 20 root cause positioning accuracy results of the embodiment, rank1, rank3 and Rank5 respectively represent whether the previous 1, 3 and 5 root causes can be positioned to the true root cause, 1 represents that the positioning can be successfully performed, and 0 represents that the positioning cannot be successfully performed. The results are shown in Table 5.
TABLE 5 final results of the invention
Referring to fig. 4, fig. 4 is a schematic structural diagram of the device of the present invention.
The apparatus 401 specifically includes: processor 402 and storage device 403.
Micro-service system root cause positioning device 401 based on graph neural network model: the root cause positioning device 401 of the micro service system based on the graph neural network model realizes the root cause positioning method of the micro service system based on the graph neural network model.
Processor 402: the processor 402 loads and executes the instructions and data in the storage device 403 to implement the root cause positioning method of the micro service system based on the graph neural network model.
Storage device 403: the storage device 403 stores instructions and data; the storage device 403 is configured to implement the root cause positioning method of the micro service system based on the graph neural network model.
In combination, the invention has the beneficial effects that: according to the method for positioning the abnormal root cause of the micro-service system based on the time sequence node sampling graph neural network model and the random walk algorithm, 1) the abnormality of the micro-service system can be detected rapidly and accurately, and the positioning granularity is reduced to an instance level; 2) The dynamic change of the micro-service system is well adapted by effectively combining the machine learning model with the dynamic diagram calculation method.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (8)
1. A root cause positioning method of a micro-service system based on a graph neural network model is characterized by comprising the following steps of: the method comprises the following steps:
s1, collecting historical fault multidimensional time sequence performance indexes of a micro-service system;
s2, constructing a graph neural network model; training the graph neural network model by using the historical fault multidimensional time sequence performance index to obtain a trained graph neural network model;
s3, constructing a heterogeneous topological graph of the micro service system at an instance level through the collected real-time micro service topological structure and the collected calling relationship;
s4, adjusting the abnormal weight of each micro service node by combining the service request link;
and S5, inputting the root cause candidate set and the real-time index characteristic data of the abnormal time window into a graph neural network model, and obtaining final root cause and the root cause abnormal type after characteristic weighting.
2. The method for positioning root cause of micro service system based on graphic neural network model as set forth in claim 1, wherein: the multi-dimensional time sequence performance index in the step S1 comprises the following steps: an index of a micro service level, an index of a micro service instance level, and an index of a host level.
3. The method for positioning root cause of micro service system based on graphic neural network model as set forth in claim 1, wherein: the step S2 specifically comprises the following steps:
s21, carrying out data sampling on the multi-dimensional time sequence performance index according to a fixed time interval to obtain sampling points; inputting the sampling points to an encoder of the graph neural network model to obtain index data characteristics of the sampling points;
s22, taking index data characteristics of the same abnormal interval sampling points as nodes of the graph network; connecting nodes in the same abnormal section according to time sequence to form the edge of the nodes in the same abnormal section of the graph network;
s23, inputting the feature vector of each node and the feature vector of the adjacent node into an aggregator of the graph neural network model according to a fixed sampling number, and aggregating by adopting a convolution layer;
s24, selecting proper number of convolution layers, marking corresponding fault type labels for different abnormal time windows, training the graph neural network model, and outputting the trained graph neural network model when the classification loss function converges to an expected value.
4. The method for positioning root cause of micro service system based on graphic neural network model as claimed in claim 2, wherein the method comprises the following steps: the step S3 is specifically as follows:
s31, when a fault occurs, constructing a real-time topological graph according to the topological structure of the micro-service system and call link data;
s32, combining the index of the micro service level collected in the step S1 to give a firstmPersonal service node weightingServicemAnd service->There is a direct call relationship between them to the service nodes m Service nodes n Assigning weights to data edges in combination with service invocation delay indicatorss m -s n ];
S33, combining the indexes of the micro-service example level collected in the step S1 to give the firstmThe first of the individual servicesjPersonal instance nodesi mj Weight is givenWherein->Representing instance nodessi mj Container CPU load->Indicating the load of the memory in the container,representing the network load of the container>Representing container throughput, +.>Representing the success rate of the request response of the container, and then calculating the correlation degree of various different index sequences according to the Pearson correlation coefficient; will eventually bemThe first of the individual servicesjThe example edge gives the greatest relevance +.>;
S34, combining the host level indexes collected in the step S1 to give a firstkThe individual hosts assign weightsWherein->Representing the CPU load of the host machine,representing the memory load of the host,/->Representing the network load of a host, and then calculating the correlation degree of various different index sequences according to the Pearson correlation coefficient; finally, the maximum correlation is given to the host machine and all the example node edges on the host machine>。
5. The method for positioning root cause of micro service system based on graphic neural network model as set forth in claim 4, wherein: the calculation formula of the pearson correlation coefficient is as follows:
wherein the method comprises the steps ofx,yTwo sequence data for which correlation needs to be calculated.
6. The method for positioning root cause of micro service system based on graphic neural network model as set forth in claim 1, wherein: the step S4 is specifically as follows:
s41, give service nodeAssigning a personalized value as an average value of all the connecting edge weights of the personalized value, wherein the personalized value comprises the following components: directly calling edges and subordinate edges of all instances of the service and the service between service nodes;
s42, giving example nodesi mj Assigning a personalized value to itThe edge weight of the belonging service;
s43, giving host noden k Giving personalized value as average value of edge weight value of the personalized value and all examples on the host;
s44, adopting a personalized random walk algorithm to sort the abnormal degrees of all nodes in a descending order on the heterogeneous topological graph to generate a preliminary root cause candidate set.
7. The method for positioning root cause of micro service system based on graphic neural network model as set forth in claim 1, wherein: the step S5 is specifically as follows:
s51, when a micro-service system runs in real time and fails, collecting multi-dimensional index data of the whole cluster in an abnormal time window;
s52, inputting the multidimensional index data into the neural network model trained in the step S2 to obtain classification weights of different types of root causes in the real-time abnormal interval;
and S53, carrying out product operation on the root cause candidate set obtained in the step S4 and the classification weight in the step S52 to obtain the final root cause ranking and the abnormal type, wherein the higher the ranking is, the more likely the root cause is.
8. The utility model provides a micro-service system root cause positioner based on graph neural network model which characterized in that: comprising the following steps: a processor and a storage device; the processor loads and executes instructions and data in the storage device to implement a root cause positioning method for a micro-service system based on a graph neural network model according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311854026.8A CN117560275B (en) | 2023-12-29 | 2023-12-29 | Root cause positioning method and device for micro-service system based on graphic neural network model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311854026.8A CN117560275B (en) | 2023-12-29 | 2023-12-29 | Root cause positioning method and device for micro-service system based on graphic neural network model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117560275A true CN117560275A (en) | 2024-02-13 |
CN117560275B CN117560275B (en) | 2024-03-12 |
Family
ID=89813030
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311854026.8A Active CN117560275B (en) | 2023-12-29 | 2023-12-29 | Root cause positioning method and device for micro-service system based on graphic neural network model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117560275B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118503001A (en) * | 2024-07-17 | 2024-08-16 | 安徽思高智能科技有限公司 | RPA task flow-oriented fault diagnosis method and equipment |
CN118708395A (en) * | 2024-08-27 | 2024-09-27 | 深圳开鸿数字产业发展有限公司 | Super equipment fault detection method and system based on multidimensional data analysis |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112929223A (en) * | 2021-03-08 | 2021-06-08 | 北京邮电大学 | Method and system for training neural network model based on federal learning mode |
CN113014421A (en) * | 2021-02-08 | 2021-06-22 | 武汉大学 | Micro-service root cause positioning method for cloud native system |
CN113285831A (en) * | 2021-05-24 | 2021-08-20 | 广州大学 | Network behavior knowledge intelligent learning method and device, computer equipment and storage medium |
CN113467421A (en) * | 2021-07-01 | 2021-10-01 | 中国科学院计算技术研究所 | Method for acquiring micro-service health status index and micro-service abnormity diagnosis method |
WO2021217855A1 (en) * | 2020-04-30 | 2021-11-04 | 平安科技(深圳)有限公司 | Abnormal root cause positioning method and apparatus, and electronic device and storage medium |
CN113900845A (en) * | 2021-09-28 | 2022-01-07 | 大唐互联科技(武汉)有限公司 | Method and storage medium for micro-service fault diagnosis based on neural network |
CN114385397A (en) * | 2021-12-31 | 2022-04-22 | 广西大学 | Micro-service fault root cause positioning method based on fault propagation diagram |
CN114615019A (en) * | 2022-02-15 | 2022-06-10 | 北京云集智造科技有限公司 | Anomaly detection method and system based on micro-service topological relation generation |
CN114721860A (en) * | 2022-05-23 | 2022-07-08 | 北京航空航天大学 | Micro-service system fault positioning method based on graph neural network |
CN115640159A (en) * | 2022-11-03 | 2023-01-24 | 香港中文大学深圳研究院 | Micro-service fault diagnosis method and system |
US20230069074A1 (en) * | 2021-08-20 | 2023-03-02 | Nec Laboratories America, Inc. | Interdependent causal networks for root cause localization |
CN115859143A (en) * | 2022-11-14 | 2023-03-28 | 之江实验室 | Graph neural network anomaly detection method and device based on neighborhood node structure coding |
CN115981902A (en) * | 2022-12-16 | 2023-04-18 | 武汉大学 | Fine-grained distributed micro-service system abnormal root cause positioning method and device |
CN116633758A (en) * | 2023-03-21 | 2023-08-22 | 湖北工业大学 | Network fault prediction method and system based on full-heterogeneous element comparison learning model |
-
2023
- 2023-12-29 CN CN202311854026.8A patent/CN117560275B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021217855A1 (en) * | 2020-04-30 | 2021-11-04 | 平安科技(深圳)有限公司 | Abnormal root cause positioning method and apparatus, and electronic device and storage medium |
CN113014421A (en) * | 2021-02-08 | 2021-06-22 | 武汉大学 | Micro-service root cause positioning method for cloud native system |
CN112929223A (en) * | 2021-03-08 | 2021-06-08 | 北京邮电大学 | Method and system for training neural network model based on federal learning mode |
CN113285831A (en) * | 2021-05-24 | 2021-08-20 | 广州大学 | Network behavior knowledge intelligent learning method and device, computer equipment and storage medium |
CN113467421A (en) * | 2021-07-01 | 2021-10-01 | 中国科学院计算技术研究所 | Method for acquiring micro-service health status index and micro-service abnormity diagnosis method |
US20230069074A1 (en) * | 2021-08-20 | 2023-03-02 | Nec Laboratories America, Inc. | Interdependent causal networks for root cause localization |
CN113900845A (en) * | 2021-09-28 | 2022-01-07 | 大唐互联科技(武汉)有限公司 | Method and storage medium for micro-service fault diagnosis based on neural network |
CN114385397A (en) * | 2021-12-31 | 2022-04-22 | 广西大学 | Micro-service fault root cause positioning method based on fault propagation diagram |
CN114615019A (en) * | 2022-02-15 | 2022-06-10 | 北京云集智造科技有限公司 | Anomaly detection method and system based on micro-service topological relation generation |
CN114721860A (en) * | 2022-05-23 | 2022-07-08 | 北京航空航天大学 | Micro-service system fault positioning method based on graph neural network |
CN115640159A (en) * | 2022-11-03 | 2023-01-24 | 香港中文大学深圳研究院 | Micro-service fault diagnosis method and system |
CN115859143A (en) * | 2022-11-14 | 2023-03-28 | 之江实验室 | Graph neural network anomaly detection method and device based on neighborhood node structure coding |
CN115981902A (en) * | 2022-12-16 | 2023-04-18 | 武汉大学 | Fine-grained distributed micro-service system abnormal root cause positioning method and device |
CN116633758A (en) * | 2023-03-21 | 2023-08-22 | 湖北工业大学 | Network fault prediction method and system based on full-heterogeneous element comparison learning model |
Non-Patent Citations (1)
Title |
---|
蒋宗礼;李苗苗;张津丽;: "基于融合元路径图卷积的异质网络表示学习", 计算机科学, no. 07, 31 December 2020 (2020-12-31) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118503001A (en) * | 2024-07-17 | 2024-08-16 | 安徽思高智能科技有限公司 | RPA task flow-oriented fault diagnosis method and equipment |
CN118708395A (en) * | 2024-08-27 | 2024-09-27 | 深圳开鸿数字产业发展有限公司 | Super equipment fault detection method and system based on multidimensional data analysis |
Also Published As
Publication number | Publication date |
---|---|
CN117560275B (en) | 2024-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117560275B (en) | Root cause positioning method and device for micro-service system based on graphic neural network model | |
CN111373415A (en) | Analyzing sequence data using neural networks | |
CN108345544A (en) | A kind of software defect distribution analysis of Influential Factors method based on complex network | |
CN108804576B (en) | Domain name hierarchical structure detection method based on link analysis | |
Bogatinovski et al. | Self-supervised anomaly detection from distributed traces | |
CN108683560A (en) | A kind of performance benchmark test system and method for high amount of traffic processing frame | |
CN113900844B (en) | Fault root cause positioning method, system and storage medium based on service code level | |
CN111539493B (en) | Alarm prediction method and device, electronic equipment and storage medium | |
CN111027591B (en) | Node fault prediction method for large-scale cluster system | |
CN115373888A (en) | Fault positioning method and device, electronic equipment and storage medium | |
WO2021062219A1 (en) | Clustering data using neural networks based on normalized cuts | |
CN112613666A (en) | Power grid load prediction method based on graph convolution neural network and transfer learning | |
CN113221475A (en) | Grid self-adaption method for high-precision flow field analysis | |
Mei et al. | Machinery condition monitoring in the era of industry 4.0: A relative degree of contribution feature selection and deep residual network combined approach | |
Chang et al. | Scientific Data Analysis using Neo4j. | |
CN109977131A (en) | A kind of house type matching system | |
WO2024056051A1 (en) | Non-intrusive flexible load aggregation characteristic identification and optimization method, apparatus, and device | |
CN105677565A (en) | Defect correlation coefficient measuring method | |
Li et al. | Root cause analysis of anomalies based on graph convolutional neural network | |
CN113761460A (en) | Ductile power distribution network load outage loss risk assessment method and system | |
CN112766509A (en) | Method for analyzing fault propagation path of electronic information system | |
CN118427578B (en) | Micro-service system data evaluation method, device and medium based on chaotic engineering | |
CN115065605B (en) | Cloud manufacturing resource node importance assessment method and system | |
Chen et al. | En-beats: A novel ensemble learning-based method for multiple resource predictions in cloud | |
CN118313628B (en) | Workshop resource allocation scheme generation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |