CN117560275B

CN117560275B - Root cause positioning method and device for micro-service system based on graphic neural network model

Info

Publication number: CN117560275B
Application number: CN202311854026.8A
Authority: CN
Inventors: 袁水平; 余螯; 朱雨涵; 张泽锟; 王健
Original assignee: Anhui Sigao Intelligent Technology Co ltd
Current assignee: Anhui Sigao Intelligent Technology Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-03-12
Anticipated expiration: 2043-12-29
Also published as: CN117560275A

Abstract

The invention relates to a root cause positioning method and a root cause positioning device of a micro-service system based on a graph neural network model, comprising the following steps: constructing a graph neural network model; training the graph neural network model by using the historical fault multidimensional time sequence performance index to obtain a trained graph neural network model; constructing a heterogeneous topological graph of the micro service system at an instance level through the collected real-time micro service topological structure and the call relationship; adjusting the abnormal weight of each micro service node by combining the service request link; and inputting the root cause candidate set and the real-time index characteristic data of the abnormal time window into a graph neural network model, and obtaining final root cause and the root cause abnormal type after characteristic weighting. The device is used for realizing the method. The invention has the beneficial effects that: the abnormality of the micro-service system can be detected rapidly and accurately, and the positioning granularity is reduced to an instance level; the dynamic change of the micro-service system is well adapted by effectively combining the machine learning model with the dynamic diagram calculation method.

Description

Root cause positioning method and device for micro-service system based on graphic neural network model

Technical Field

The invention relates to the field of fault positioning of server systems, in particular to a root cause positioning method and device of a micro-service system based on a graph neural network model.

Background

With the development of the internet, cloud computing and computer industries, more and more systems are designed and built by adopting a micro-service architecture, and the micro-service architecture is widely applied to various actual scenes, for example: large enterprise applications, internet of things applications, and cloud services. The micro-service architecture can bring high availability, high expansibility and elastic expansion capability to the system so as to better adapt to the requirements of the current large-scale software application. In recent years, the concept of a cloud native software architecture has been developed as a method for constructing and running an application program, which makes the application program need to consider the running scene of the cloud environment at the time of design. The micro-service is one of the core points of the cloud native software architecture, the cloud native software architecture requires an application program to be designed and constructed in the form of the micro-service, communication and interaction are carried out between the services through a RESTful API, and the cloud native software architecture can fully utilize the capability of high cloud availability and high accommodation, so that the application program can be finally loaded and supported by the cloud in the form of a container. On the basis of micro-services, the cloud native software architecture can be transversely expanded in a very large scale, and has high availability and safety. However, how to better guarantee the reliability and observability of a large-scale micro-service system, and to better locate the service root cause when an abnormality occurs, are also facing a number of difficulties. An effective method is designed to automatically help operation and maintenance personnel to locate the root cause of the fault, which has important significance.

Currently, challenges to the micro-service root due to the localization problem are: 1) The positioning granularity is too large: the current micro-service root can be basically positioned to the micro-service granularity only and cannot be positioned to the micro-service embodiment granularity, but in a real scene, a certain micro-service instance or a container abnormality where the micro-service embodiment is located eventually causes jitter, and when the micro-service has a plurality of instances, the micro-service instance which should be checked or restarted cannot be known well. 2) The monitoring indexes are as follows: the index data which can be collected by the monitoring system not only comprises the index data of the micro service level, but also comprises the index data of the micro service instance, the container where the micro service instance is located and the host where the container is located, and the multi-dimensional data of the micro service system can be fully utilized to further position the micro service abnormal root cause with finer granularity, namely the service instance level. 3) Abnormal root cause type is ambiguous: root cause positioning is the first step when the micro-service system is abnormal, the type of the root cause abnormality is better screened, key information can be provided for the subsequent maintenance and repair process, and the current root cause positioning method is less related to the research and discussion of the angle.

In a micro-service system, a service is a collection of service instances, which are the smallest units that carry and run the actual business processes. After the service receives the request, the request is routed to the designated service instance through a variety of different load balancing policies. Dynamic changes in service instances are frequent and difficult to predict, coupled with constraints on system resources, traffic size, and bearer capability, and different resource constraints on different instances are often the root cause of anomalies in single or multiple service instances.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a root cause positioning method of a micro-service system based on a graph neural network model, which comprises the following steps:

s1, collecting historical fault multidimensional time sequence performance indexes of a micro-service system;

s2, constructing a graph neural network model; training the graph neural network model by using the historical fault multidimensional time sequence performance index to obtain a trained graph neural network model;

s3, constructing a heterogeneous topological graph of the micro service system at an instance level through the collected real-time micro service topological structure and the collected calling relationship;

s4, adjusting the abnormal weight of each micro service node by combining the service request link;

and S5, inputting the root cause candidate set and the real-time index characteristic data of the abnormal time window into a graph neural network model, and obtaining final root cause and the root cause abnormal type after characteristic weighting.

A micro-service system root cause positioning device based on a graph neural network model, comprising: a processor and a storage device; the processor loads and executes instructions and data in the storage device, and the instructions and data are used for realizing the root cause positioning method of the micro-service system based on the graph neural network model.

The beneficial effects provided by the invention are as follows: according to the method for positioning the abnormal root cause of the micro-service system based on the time sequence node sampling graph neural network model and the random walk algorithm, 1) the abnormality of the micro-service system can be detected rapidly and accurately, and the positioning granularity is reduced to an instance level; 2) The dynamic change of the micro-service system is well adapted by effectively combining the machine learning model with the dynamic diagram calculation method.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a neural network model design of an embodiment of the present invention;

FIG. 3 is a diagram illustrating heterogeneous topologies of an embodiment of the invention;

fig. 4 is a schematic view of the structure of the device of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic diagram of a process flow of the present invention; the invention provides a root cause positioning method of a micro-service system based on a graph neural network model, which specifically comprises the following steps:

it should be noted that, the multi-dimensional time sequence performance index in step S1 includes: an index of a micro service level, an index of a micro service instance level, and an index of a host level.

As an embodiment, the index of the micro service level includes: the relation of the service level serves the grid network delay fault and network packet loss fault;

an index of service instance level, comprising: instance level CPU high load fault, memory high load fault, instance network delay fault and instance abnormal termination fault;

host level metrics, including: high load faults of a host level CPU, high load faults of a memory, high load faults of file reading and writing and the like.

referring to fig. 2, fig. 2 is a schematic diagram of a neural network model design according to an embodiment of the invention.

The step S2 is specifically as follows:

s21, carrying out data sampling on the multi-dimensional time sequence performance index according to a fixed time interval to obtain sampling points; inputting the sampling points to an encoder of the graph neural network model to obtain index data characteristics of the sampling points;

the multi-dimensional index data for each sample point is normalized to a value between (0, 1) before being input to the encoder. The encoder refers to the idea of word embedding model, receives a certain number of node index data, the number of nodes is similar to the number of words, the micro-service multi-dimensional index data of each time sequence node is similar to the feature vector of the words, and the dimension conversion is needed because the dimension of the micro-service index data is influenced by the micro-service topological structure at the current moment, and the dimension conversion is needed to be converted into a unified feature dimension to represent the feature of each time sequence node for training of the neural network model of the subsequent graph.

S22, taking index data characteristics of the same abnormal interval sampling points as nodes of the graph network; connecting nodes in the same abnormal section according to time sequence to form the edge of the nodes in the same abnormal section of the graph network;

it should be noted that, for nodes of different anomaly time intervals of the same anomaly type, according to the time lapse of anomaly injection, connecting lines are formed between the nodes to form edges of a graph network similar anomaly characteristic interval;

s23, inputting the feature vector of each node and the feature vector of the adjacent node into an aggregator of the graph neural network model according to a fixed sampling number, and aggregating by adopting a convolution layer;

the feature vector of each node and the feature vector of the adjacent node are calculated according to a fixed sampling numbernInput into the aggregator for convolution aggregation, and sampling numbernRepresenting each nodenFeature aggregation is carried out on adjacent nodes, and the adjacent nodes are smaller thannThen the aggregation process takes the number of adjacent nodes as the sampling number, and the node are combinednAnd inputting the feature vectors of the sampling points into the encoder again to perform secondary coding aggregation, and performing feature aggregation on the index features of the adjacent nodes and the current node based on a recursion idea. Each recursion process is called a convolution layer, and proper quantity of convolution layers can be reasonably selected according to the micro-service cluster scale and the index data dimension to balance the model training time and the node characteristic aggregation degree;

in the first placekWhen the sub-graph is rolled up, the same time is takenVIs carried out by micro service index data, micro service instance index data and host machine index dataCONCATMerging, feature extraction of associated index data using a method similar to word embedding representation。

S24, selecting proper number of convolution layers, marking corresponding fault type labels for different abnormal time windows, training the graph neural network model, and outputting the trained graph neural network model when the classification loss function converges to an expected value.

Randomly selecting a fixed number n of adjacent node sets among all time sequence nodes in the same abnormal intervalN(V) By means ofMEANThe method performs feature aggregation. Through the process ofkThe whole process of the secondary graph convolution is expressed asWhereinCONCATRepresenting merging and stitching nodes and feature dimensions of adjacent nodes, < >>Indicating that the adjacent node is at the firstkFeature sets of the secondary graph convolution.

the step S3 is specifically as follows:

s31, when a fault occurs, constructing a real-time topological graph according to the topological structure of the micro-service system and call link data;

s32, combining the index of the micro service level collected in the step S1 to give a firstmPersonal service node weightingServicemAnd service->There is a direct call relationship between them to the service nodes _m Service nodes _n Assigning weights to data edges in combination with service invocation delay indicatorss _m -s _n ]；

S33, combining the indexes of the micro-service example level collected in the step S1 to give the firstmThe first of the individual servicesjPersonal instance nodesi _mj Weight is givenWherein->Representing instance nodessi _mj Container CPU load->Indicating the load of the memory in the container,representing the network load of the container>Representing container throughput, +.>Representing the success rate of the request response of the container, and then calculating the correlation degree of various different index sequences according to the Pearson correlation coefficient; will eventually bemThe first of the individual servicesjThe example edge gives the greatest relevance +.>；

S34, combining the host level indexes collected in the step S1 to give a firstkThe individual hosts assign weightsWherein->Representing the CPU load of the host,/>Representing the memory load of the host,/->Representing the network load of a host, and then calculating the correlation degree of various different index sequences according to the Pearson correlation coefficient; finally, the maximum correlation is given to the host machine and all instance node edges on the host machine。

The calculation formula of the pearson correlation coefficient is as follows:

wherein the method comprises the steps ofx,yTwo sequence data for which correlation needs to be calculated.

the step S4 is specifically as follows:

s41, give service nodeAssigning a personalized value as an average value of all the connecting edge weights of the personalized value, wherein the personalized value comprises the following components: directly calling edges and subordinate edges of all instances of the service and the service between service nodes;

s42, giving example nodesi _mj Assigning a personalized value toAn edge weight value of the service to which the service belongs;

s43, giving host noden _k Giving personalized value as average value of edge weight value of the personalized value and all examples on the host;

s44, adopting a personalized random walk algorithm to sort the abnormal degrees of all nodes in a descending order on the heterogeneous topological graph to generate a preliminary root cause candidate set.

The calculation method of the personalized random walk adopts the following formula to calculate:

wherein the method comprises the steps ofvRepresenting the final scoring result of the node, and ranking the results of the instance root cause positioning simultaneously;Pin order to personalize the array of data,cto continue the probability of random walk forward,uscoring the result for the next node. After multiple rounds of walk iterations, the scoring results for each node will tend to converge, producing a preliminary set of root cause candidates.

The step S5 is specifically as follows:

s51, when a micro-service system runs in real time and fails, collecting multi-dimensional index data of the whole cluster in an abnormal time window;

s52, inputting the multidimensional index data into the neural network model trained in the step S2 to obtain classification weights of different types of root causes in the real-time abnormal interval;

and S53, carrying out product operation on the root cause candidate set obtained in the step S4 and the classification weight in the step S52 to obtain the final root cause ranking and the abnormal type, wherein the higher the ranking is, the more likely the root cause is.

As an example, the present invention is illustrated in Hipster Shop;

early stage preparation of experiment: the experimental environment is three Ubuntu physical machines, kubernetes, istio and Prometheus are installed on the physical machines. The Hipster Shop micro-service system is used as an example: wherein Hipster Shop is a micro-service business demonstration application comprising 12 micro-services. The application is a Web-based e-commerce application in which a user can browse goods, add the goods to a shopping cart, and make purchases. Including 8 business micro-services and 4 analog micro-services to implement the shopping process. The hardware and software information for a particular environment is shown in table 1.

The injected fault types and data set sizes are shown in table 2.

In order to simulate a real user scene, the embodiment uses the locusts as simulated concurrency generators to generate different workloads for simulating user concurrency behaviors for different business scenes. Meanwhile, in order to simulate the performance problem of a real environment, the following common anomalies are injected by adopting a chaos engineering tool ChaoMesh. 1) Delay; 2) Container instance CPU load; 3) A container instance memory load; 4) Network packet loss of container examples; 5) The container process stops. Collecting historical multidimensional index data of the occurrence of the abnormality;

table 1 hardware and software information table of the environment of the embodiment of the present invention

Table 2 injected fault types and data set sizes

The index data reported in the examples are shown in table 3.

The training parameters of the graph neural network model trained in the present invention are shown in table 4.

Table 3 index data reported by examples

TABLE 4 training parameters for neural network models

Referring to fig. 3, as shown in fig. 3, when a real-time fault occurs, a heterogeneous topology graph including all micro service nodes, instance nodes and host nodes is constructed, and weights are given to nodes and edges of the heterogeneous topology graph by combining cluster multi-dimension index data in the time interval.

And calculating the personalized array value of each node, and executing a personalized random walk algorithm to obtain a final root cause ranking list.

Finally, according to the 20 root cause positioning accuracy results of the embodiment, rank1, rank3 and Rank5 respectively represent whether the previous 1, 3 and 5 root causes can be positioned to the true root cause, 1 represents that the positioning can be successfully performed, and 0 represents that the positioning cannot be successfully performed. The results are shown in Table 5.

TABLE 5 final results of the invention

Referring to fig. 4, fig. 4 is a schematic structural diagram of the device of the present invention.

The apparatus 401 specifically includes: processor 402 and storage device 403.

Micro-service system root cause positioning device 401 based on graph neural network model: the root cause positioning device 401 of the micro service system based on the graph neural network model realizes the root cause positioning method of the micro service system based on the graph neural network model.

Processor 402: the processor 402 loads and executes the instructions and data in the storage device 403 to implement the root cause positioning method of the micro service system based on the graph neural network model.

Storage device 403: the storage device 403 stores instructions and data; the storage device 403 is configured to implement the root cause positioning method of the micro service system based on the graph neural network model.

In combination, the invention has the beneficial effects that: according to the method for positioning the abnormal root cause of the micro-service system based on the time sequence node sampling graph neural network model and the random walk algorithm, 1) the abnormality of the micro-service system can be detected rapidly and accurately, and the positioning granularity is reduced to an instance level; 2) The dynamic change of the micro-service system is well adapted by effectively combining the machine learning model with the dynamic diagram calculation method.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A root cause positioning method of a micro-service system based on a graph neural network model is characterized by comprising the following steps of: the method comprises the following steps:

the step S2 specifically comprises the following steps:

s24, selecting proper number of convolution layers, marking corresponding fault type labels for different abnormal time windows, training the graph neural network model, and outputting the trained graph neural network model when the classification loss function converges to an expected value;

the step S4 is specifically as follows:

s42, giving example nodesi _mj Giving an individuation value as an edge weight value of the service to which the individuation value belongs;

s44, adopting a personalized random walk algorithm to sort the abnormal degrees of all nodes in a descending order on the heterogeneous topological graph to generate a preliminary root cause candidate set;

s5, inputting the root cause candidate set and the real-time index characteristic data of the abnormal time window into a graph neural network model, and obtaining final root cause and the root cause abnormal type after characteristic weighting;

the step S5 is specifically as follows:

2. The method for positioning root cause of micro service system based on graphic neural network model as set forth in claim 1, wherein: the multi-dimensional time sequence performance index in the step S1 comprises the following steps: an index of a micro service level, an index of a micro service instance level, and an index of a host level.

3. The method for positioning root cause of micro service system based on graphic neural network model as set forth in claim 1, wherein: the step S3 is specifically as follows:

S33, combining the indexes of the micro-service example level collected in the step S1 to give the firstmThe first of the individual servicesjPersonal instance nodesi _mj Weight is givenWhereinRepresenting instance nodessi _mj Container CPU load->Indicating the load of the container memory,/-, and>representing the network load of the container>Representing container throughput, +.>Representing the success rate of the request response of the container, and then calculating the correlation degree of various different index sequences according to the Pearson correlation coefficient; will eventually bemThe first of the individual servicesjThe example edge gives the greatest relevance +.>；

S34, combining the host level indexes collected in the step S1 to give a firstkThe individual hosts assign weightsWherein->Representing the CPU load of the host machine,representing the memory load of the host,/->Representing the network load of a host, and then calculating the correlation degree of various different index sequences according to the Pearson correlation coefficient; finally, the maximum correlation is given to the host machine and all the example node edges on the host machine>。

4. A method for positioning root cause of micro service system based on graphic neural network model as set forth in claim 3, wherein: the calculation formula of the pearson correlation coefficient is as follows:

5. The utility model provides a micro-service system root cause positioner based on graph neural network model which characterized in that: comprising the following steps: a processor and a storage device; the processor loads and executes instructions and data in the storage device to implement a root cause positioning method for a micro-service system based on a graph neural network model according to any one of claims 1 to 4.