CN115309575A

CN115309575A - Micro-service fault diagnosis method, device and equipment based on graph convolution neural network

Info

Publication number: CN115309575A
Application number: CN202210736465.8A
Authority: CN
Inventors: 张圣林; 金鹏翔; 孙永谦; 张弼铖; 林子涵; 夏思博; 金娃
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-11-08

Abstract

The invention discloses a micro-service fault diagnosis method, a device and equipment based on a graph convolution neural network, which mainly can quickly and accurately position a root cause micro-service node and judge the type of a micro-service fault. The method comprises the following steps: acquiring alarm events of target micro-services to be diagnosed within a preset time period before and after a fault, wherein the alarm events are generated based on multi-modal data; determining an alarm event sequence of each micro-service node in the target micro-service according to the alarm event; generating an event vector of each alarm event in the alarm event sequence, and calculating a node vector of a micro service node corresponding to each alarm event based on the event vector; and inputting the node vector of each micro service node into a trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type.

Description

Micro-service fault diagnosis method, device and equipment based on graph convolution neural network

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for diagnosing a microservice fault based on a graph convolution neural network.

Background

In recent years, with the increasing popularization of micro-service software architecture, each internet enterprise starts to split a large-scale application under a single architecture into a plurality of micro-services with independent functions, so as to meet complicated and diversified business demands and continuously expanding business scale. However, after the micro service architecture is introduced, the increase of the system scale and the increase of the complexity and the flexibility bring greater challenges to the operation and maintenance department of the internet enterprise. The operation and maintenance departments of all internet enterprises can continuously record various state information of the micro-service system, and when the abnormal condition is detected, the operation and maintenance departments need to quickly and accurately locate the fault root cause and quickly take measures to avoid excessive loss. Therefore, fault diagnosis is an important link in the whole operation and maintenance process, and how to quickly and accurately locate the fault root cause micro service nodes and analyze the fault types becomes a problem to be solved urgently.

In contemporary internet enterprises, most of the existing fault diagnosis methods only start with one or two types of data, and usually there is a special data platform for collecting relevant data (such as log data, index data, call chain data) during the operation of the microservice system, and when the microservice system is in operation, the monitoring system can monitor the operation state of the microservice system in real time. When the system is detected to be abnormal, a large amount of alarm information can be generated by multi-mode data such as logs, indexes, call chains and the like. In the face of a large amount of alarm information generated by a huge micro-service system, fault diagnosis still needs to be carried out by relying on more manpower and expert experience, and the mode cannot meet the requirements of a large-scale micro-service system with rapidly expanding scale and increasingly complex structure.

Disclosure of Invention

The invention provides a micro-service fault diagnosis method, a device and equipment for a graph convolution neural network, which mainly can quickly and accurately locate a root cause micro-service node and judge the type of a micro-service fault.

According to a first aspect of the present invention, there is provided a micro-service fault diagnosis method based on a graph convolution neural network, including:

collecting alarm events of target micro-services to be diagnosed in a preset time period before and after a fault, wherein the alarm events are generated based on multi-modal data;

determining an alarm event sequence of each micro-service node in the target micro-service according to the alarm event;

generating an event vector of each alarm event in the alarm event sequence, and calculating a node vector of a micro service node corresponding to each alarm event based on the event vector;

and inputting the node vector of each micro service node into a trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type.

According to a second aspect of the present invention, there is provided a microservice fault diagnosis apparatus based on a convolutional neural network, the apparatus comprising:

the system comprises an acquisition module, a fault diagnosis module and a fault diagnosis module, wherein the acquisition module is used for acquiring alarm events of target micro-services to be diagnosed in a preset time period before and after a fault, and the alarm events are generated based on multi-mode data;

the determining module is used for determining an alarm event sequence of each micro-service node in the target micro-service according to the alarm event;

the calculation module is used for generating an event vector of each alarm event in the alarm event sequence and calculating a node vector of the micro service node corresponding to each alarm event based on the event vector;

and the input module is used for inputting the node vector of each micro service node into the trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type.

According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program:

acquiring alarm events of target micro-services to be diagnosed within a preset time period before and after a fault, wherein the alarm events are generated based on multi-modal data;

Compared with the current mode of carrying out fault detection on single-dimensional data, the micro-service fault diagnosis method, device and equipment based on the graph convolution neural network can acquire alarm events of a target micro-service to be diagnosed in a preset time period before and after a fault based on multi-mode data, and further determine an alarm event sequence of each micro-service node in the target micro-service according to the alarm events; then generating event vectors of all alarm events in the alarm event sequence, and calculating node vectors of micro service nodes corresponding to all alarm events based on the event vectors; and finally, inputting the node vector of each micro service node into the trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type. According to the technical scheme, three modes of micro-service alarm events can be combined, a topological structure is used, accurate positioning of the root cause micro-service nodes and accurate prediction of micro-service fault types are achieved by using the graph convolution neural network, so that time spent by operation and maintenance personnel on fault recovery is reduced, the process from fault occurrence to fault recovery of the micro-service system is accelerated, and loss is reduced to the maximum extent.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic flowchart illustrating a micro-service fault diagnosis method based on a graph-convolution neural network according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating another micro-service fault diagnosis method based on a graph convolution neural network according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example call chain provided by an embodiment of the invention;

FIG. 4 is a diagram illustrating a Span in a call chain on a time axis according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a model structure used by a training event vector generator according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a fault diagnosis model provided by an embodiment of the invention;

FIG. 7 is a flow chart of a micro service system service provided by an embodiment of the present invention;

FIG. 8 illustrates a microservice deployment diagram provided by an embodiment of the invention;

FIG. 9 is a diagram illustrating a topology of a homogenous graph according to an embodiment of the present invention;

FIG. 10 illustrates a schematic diagram of a heterogeneous graph topology provided by an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of a first convolutional neural network model according to an embodiment of the present invention;

FIG. 12 is a schematic structural diagram of a second convolutional neural network model provided in an embodiment of the present invention;

fig. 13 is a schematic structural diagram illustrating a microservice fault diagnosis apparatus based on a graph-convolution neural network according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram illustrating another microservice fault diagnosis apparatus based on a convolutional neural network according to an embodiment of the present invention;

fig. 15 shows a physical structure diagram of a computer device according to an embodiment of the present invention.

Detailed Description

At present, in the face of a complex large-scale micro-service system, more and more enterprises begin to build intelligent operation and maintenance platforms, and it is hoped that the problems that the traditional operation and maintenance cannot solve are solved through methods (machine learning and deep learning) in the field of artificial intelligence based on a large amount of operation and maintenance data. In the face of massive multi-modal data and complex and changeable topological structures, rapid and accurate fault diagnosis can help enterprises to repair faults in time, and excessive loss is avoided. Therefore, in the micro service system, the research on the fault diagnosis problem has very important significance and practical value.

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In order to realize the rapid and accurate positioning of the root cause micro-service node and accurately judge the micro-service fault type, the invention provides a micro-service fault diagnosis method based on a graph convolution neural network, as shown in figure 1, the method comprises the following steps:

101. and acquiring alarm events of the target micro-service to be diagnosed in a preset time period before and after the fault, wherein the alarm events are generated based on the multi-modal data.

In recent years, these studies can be classified into: the method comprises four categories of using log data, using index data, using call chain data and combining multi-modal data. However, most of the existing fault diagnosis methods only start from one or two types of data, and the research work of combining multiple modal data and using the topological structure is less. In the internet enterprise fault diagnosis process, data which can be acquired by an operation and maintenance data acquisition platform are multi-source heterogeneous and include but are not limited to log data, platform index data (such as CPU utilization rate, network throughput and the like), service index data (such as user order success number and failure number in a certain e-commerce APP unit time), call chain data, configuration management data and the like. When applying various algorithms for fault diagnosis, multi-modal data should be used as much as possible, because there may be a problem of "blind feeling" using data of a single channel. For example, when the CPU utilization is increased suddenly and the memory utilization is increased suddenly, these exceptions are usually reflected on the index data and do not appear in the log data; when the user fails to log in, such information may be recorded in the log data or call chain data, and not reflected on the index data. Compared with the data only using a single channel, the data using multiple modes can generally obtain more data characteristics, and the fault diagnosis effect is often more obvious. In other applications of artificial intelligence, it is also theoretically proven by researchers that multi-modal data construction systems can often be used to better perform tasks (e.g., video classification tasks, multi-modal modeling using video information, audio information, and subtitle information is more effective than single-modal modeling). In view of this, the present invention can combine the multi-modal operation and maintenance data (log, index, call chain data) for fault diagnosis.

For the embodiment, when the micro-service abnormality is detected, multi-modal data in a preset time period before and after the abnormality can be collected first, and an alarm event is further generated based on the multi-modal data. Multimodal data applied includes, but is not limited to, log data, metrics data, call chain data. Accordingly, alarm events include, but are not limited to, log alarm events, indicator alarm events, call chain alarm events. The duration of the preset time period may be set according to an actual application scenario, for example, 5 minutes, and is not specifically limited herein.

Among them, as for log data, a large amount of log data is generated when micro services run and when communication occurs between micro services. The log is typically semi-structured data, including a fixed log template portion and a variable parameter portion. During the fault occurrence period, the log alarm event generated by the log data contains the following fields:

a. name of micro service node: this field indicates which microservice node the log entry was generated by;

b. time stamping: this field indicates the time of generation of the log entry;

c. log template ID: this field is derived by the Drain algorithm [19] and is the hash value substring of the template to which the log entry corresponds.

Here is an example of an alarm event generated from log data:

('dbservice1'，1625122737,'d071a2a6')

for index data, the index data can be often divided into business indexes and system indexes. A business index refers to an index associated with a particular business. For example, for an e-commerce website, the number of times of access to the website within 1 minute, the number of successful transactions, the number of failed transactions, the number of registrations of new users, and the like can be regarded as service indexes. System metrics refer to metrics related to servers, containers, and the like. Indexes such as CPU utilization rate, network flow, disk occupancy rate and the like belong to platform indexes. The operation and maintenance department can continuously record the indexes and visualize all the indexes for viewing. Meanwhile, the operation and maintenance department can carry out abnormity detection on the indexes, and when abnormity occurs, an alarm can be automatically generated.

During the fault occurrence period, the indicator alarm event generated by the indicator data contains the following fields:

a. name of micro service node: the field indicates the micro-service to which the index belongs;

b. time stamping: the field indicates the index abnormal value recording time;

c. index ID: one index curve can be uniquely identified through the field;

d. the index abnormality type: this field indicates the exception type of the pointer.

Here is an example of an alarm event generated by the index data:

('webservice2',1625122737,'webservice2_0.0.0.3_docker_cpu_kernel_ticks','+')

for call chain data, a call chain represents the execution of a transaction or flow in a (distributed) system. In the opentracking standard, a call chain is a directed acyclic graph composed of a plurality of spans, each representing a named and timed sequential execution segment in the call chain. The service calls in the call chain are initially recorded separately in different microservice nodes and then integrated into a complete call chain by fields in the records. As shown in fig. 3, the call chain is presented in a tree structure, and this call chain contains all the spans associated with a certain service call. Each Span has a unique Span ID, and as such, each call chain can be uniquely identified by a call chain ID. The microservice instances of the caller and the callee are recorded in a Span, and the corresponding operation name and time delay are also recorded. Except for the root Span, each Span normally has a corresponding parent Span. In fig. 3, span a is a synchronous call, and during its execution, two new synchronous calls Span B and Span C are generated, so Span a is the parent Span of Span B and Span C. Fig. 4 shows the Span in the call chain of fig. 3 on a time axis, and such a timing diagram may more particularly show the execution. Span B and Span C are parallel calls generated by Span A, so there is overlap in the execution time of Span B and Span C. Span F, span G, span H are synchronization calls generated by Span C, so they do not overlap on the time axis.

During the fault occurrence period, the call chain alarm event generated by the call chain data contains the following fields:

a. time stamping: this field indicates the call initiation time;

b. calling the name of the micro service node: this field indicates the caller microservice node;

c. called party micro-service node name: this field indicates the callee microservice node;

here is an example of an alarm event generated by call chain data:

(1625131245，'webservice1','loginservice2')

102. and determining an alarm event sequence of each micro-service node in the target micro-service according to the alarm event.

For the embodiment, after the alarm events of the target micro-service to be diagnosed within the preset time period before and after the fault are collected, the alarm events can be divided according to the micro-service nodes. The log alarm and index alarm events only need to be attributed to the micro service node where the event occurs, and the call chain alarm events need to be attributed to the calling micro service node and the called micro service node at the same time. Furthermore, the alarm events can be uniformly arranged on each micro service node according to the sequence of the occurrence of the events, so that a uniform sequence of the alarm events in three modes, namely an alarm event sequence, is obtained.

103. And generating event vectors of all alarm events in the alarm event sequence, and calculating node vectors of micro service nodes corresponding to all alarm events based on the event vectors.

In a specific application scenario, in order to map the micro service nodes into a low-dimensional vector space, it is first necessary to express events on the micro service nodes in a vectorized manner. By way of example step 102, each microservice node corresponds to a sequence of alarm events. And m represents the number of fault cases used for training in the historical fault case library, and n represents the number of micro service nodes, so that m × n event sequences can be obtained on the training set. The event sequence is used as time sequence data, and event embedding can be trained in a mode similar to fastText training word embedding. fastText may obtain a vector representation of a word through either supervised or unsupervised learning. The structure of a model used by the training event vector generator is shown in fig. 5, the input of the model is an event sequence corresponding to a single micro service node in a single fault, a label is the fault type of the micro service node (if the micro service node is normal, the label is normal), the training process of the event vector generator is executed by using training data in a training data set, the training result of the event vector generator is verified by using verification data and test data in the training data set, the vector generation accuracy of the event vector generator is calculated according to the verification result and the training label, and if the vector generation accuracy is judged to be greater than a first preset threshold, the training of the event vector generator is determined to be completed. The first preset threshold may be any value from 0 to 1, and when the set value is closer to 1, the training precision of the event vector generator is higher, and the specific value may be set according to an actual application scenario, which is not specifically limited herein.

For the embodiment, the alarm event sequence of each micro service node may be input to the trained event vector generator, and the event vector generator may be used to output the event vector of each alarm event in the alarm event sequence. Further, a node vector of the micro service node corresponding to each alarm event may be calculated based on the event vector. However, because each micro service node generates a large number of alarm events during the micro service fault, through the analysis of the alarm events, it can be found that a part of the alarm events are generated in a large number but are not associated with the actual fault, and the part of the alarm events brings a large amount of noise data to the description of the micro service nodes; although a part of alarm events occur in a small number, by actually observing the alarm content, a small number of alarms are found to have strong correlation with the final root cause node and the fault type. Therefore, when the node vector of the micro service node corresponding to each alarm event is calculated, different weights can be given to each alarm event according to a certain rule. Then, a weighted mean value of the event vector of each micro service node is obtained according to the event weight to be used as a vector representation of the micro service node, namely a node vector.

104. And inputting the node vector of each micro service node into the trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type.

The fault diagnosis model can be a graph convolution neural network model, and specifically can include a first convolution neural network model based on a homogeneous graph and a second convolution neural network model based on a heterogeneous graph.

For this embodiment, in order to ensure the reliability and the service quality of the micro service, the operation and maintenance team needs to consider how to recover the fault quickly after the fault occurs, and in order to recover the micro service system from the fault, what needs to be done first is to locate the root cause micro service node causing the fault. During a failure, the failure rooted at the microservice node may propagate along the call chain link, causing other microservices to also exhibit an exception. Because the number of micro services in the micro service system is huge, the calling relationship between the micro services is complex, and usually, during the fault occurrence period, the monitoring system can generate a large amount of alarm information, and how to utilize the alarm information to quickly and accurately position the root cause micro service node is the first problem to be solved for fault diagnosis. After the micro service node of the failure root cause is located, the operation and maintenance personnel need to recover the micro service node from the failure urgently, and at this time, if the failure diagnosis system can automatically judge the type of the micro service failure, the time spent by the operation and maintenance personnel on recovering the failure is greatly reduced. The method can quickly and accurately judge the micro-service fault type, so that the process from fault occurrence to fault recovery of the micro-service system can be accelerated, and the loss is reduced to the maximum extent. Therefore, by using a large amount of alarm information generated by the monitoring system during the fault occurrence period, the micro-service fault type is judged to be the second problem to be solved by fault diagnosis quickly and accurately.

In a specific application scenario, before the steps of this embodiment are performed, the fault diagnosis model needs to be trained offline in advance. As shown in fig. 6, in the offline training process, historical alarm events in historical fault cases need to be collected, and a training data set needs to be divided to obtain training data, verification data, and test data. For each fault in the historical fault case base training data set, collecting logs, indexes and call chain alarm events during the fault occurrence period according to the fault occurrence time period. Then, an alarm event sequence of each micro service node is obtained, an event vector generator is obtained through analysis of the alarm event sequence on each micro service node in each fault case, and meanwhile, different weights are given to each event according to a certain rule. Next, a weighted mean value of the event vector of each node is obtained according to the event weight as a vector representation of the node, i.e. a training data set. And finally, training a fault diagnosis model by using the training data set of each node and the corresponding topological structure and label. Specifically, a cross entropy loss function can be used for training, a training process of the fault diagnosis model is executed by using training data in a training data set, a training result of the fault diagnosis model is verified by using verification data and test data in the training data set, fault diagnosis accuracy of the fault diagnosis model is calculated according to the verification result and a training label, and if the fault diagnosis accuracy is judged to be larger than a second preset threshold value, it is determined that the fault diagnosis model is trained completely. The second preset threshold and the first preset threshold may be the same or different, and may be any value from 0 to 1, and when the set value is closer to 1, the higher the training precision of the fault diagnosis model is, and a specific value may be set according to an actual application scenario, where no specific limitation is performed. It should be noted that, in order to avoid the risk of data leakage on the fault diagnosis model, the same training data set is used in training the event vector generator and the fault diagnosis model.

As an optimal mode, during off-line training, one fault case corresponds to a training sample of the graph convolution neural network, and since the training samples are unevenly distributed on fault type labels and root cause node labels in historical fault cases, the generalization capability of the model is improved by adopting a data enhancement mode. For a plurality of training samples with the same root cause node label (fault type label), vector representation of each sample in a certain micro service node is randomly exchanged to obtain more training samples under the same root cause label (fault type label). After more training samples are obtained, the training samples and the original samples are used as training data to participate in training.

After the completion of the training of the fault diagnosis model is judged, for the embodiment, the node vector of each micro service node can be input into the trained fault diagnosis model, and a fault diagnosis result is obtained, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type. The positioning result of the root cause micro service nodes can be the sequencing of the root cause micro service nodes, and further the root cause micro service node with the first sequencing can be determined as the root cause micro service node causing the target micro service fault; the prediction result of the micro-service fault types can be a fault type sequence, and further, the first fault type in the sequence can be determined as the fault type of the target micro-service. Furthermore, the positioning result of the micro service node and the prediction result of the micro service fault type can be reported to related operation and maintenance personnel.

The micro-service fault diagnosis method based on the graph convolution neural network can acquire alarm events of a target micro-service to be diagnosed in a preset time period before and after a fault based on multi-mode data, and further determine an alarm event sequence of each micro-service node in the target micro-service according to the alarm events; then generating an event vector of each alarm event in the alarm event sequence, and calculating a node vector of the micro service node corresponding to each alarm event based on the event vector; and finally, inputting the node vector of each micro service node into the trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type. According to the technical scheme, three modes of micro-service alarm events can be combined, a topological structure is used, accurate positioning of the root cause micro-service node and accurate prediction of the micro-service fault type are achieved by utilizing the graph convolution neural network, so that time spent by operation and maintenance personnel on fault recovery is reduced, the process from fault occurrence to fault recovery of the micro-service system is accelerated, and loss is reduced to the maximum extent.

Further, in order to better explain the micro-service fault detection process, as a refinement and an extension to the foregoing embodiment, an embodiment of the present invention provides another micro-service fault diagnosis method based on a convolutional neural network, as shown in fig. 2, the method includes:

201. and acquiring alarm events of the target micro-service to be diagnosed in a preset time period before and after the fault, wherein the alarm events are generated based on the multi-modal data.

For the present embodiment, the specific implementation process may refer to the related description in step 101 of the embodiment, and is not described herein again.

202. And determining an alarm event sequence of each micro-service node in the target micro-service according to the alarm event.

For this embodiment, in a specific application scenario, the step 202 of the embodiment may specifically include: attributing the index alarm event and the log alarm event to a micro service node where the event occurs, and attributing the call chain alarm event to a calling micro service node and a called micro service node at the same time; and uniformly arranging the alarm events on each micro service node according to the sequence of the occurrence of the events to obtain an alarm event sequence of each micro service node.

203. And inputting the alarm event sequence into the trained event vector generator to generate an event vector of each alarm event in the alarm event sequence.

The event vector generator is obtained by training with a training data set after data enhancement processing.

In a specific application scenario, generally, the number of fault cases is small, the fault types are distributed unevenly, and the difference between the numbers of various types of faults is large. Under the background, before the steps of the embodiment are executed, after the training data set is obtained, the method can expand the number of samples corresponding to the fault types with a smaller number in the training data set by using a data enhancement strategy so as to solve the problems of a small number of fault cases and uneven distribution of fault categories.

On this basis, a method for training an event vector generator in advance is provided, which specifically includes:

(1) Training an event vector generator in an original training data set of a fault case library, so that all alarm events occurring in the fault case library are mapped into a low-dimensional vector space to obtain vector representation of each alarm event;

(2) Considering the problems of small number of fault cases and uneven distribution of fault categories, the vectorization of the events by using the event vector generator obtained by training in (1) cannot accurately depict the difference and similarity between alarms, and is easily influenced by a large number of fault categories. Therefore, the number of samples corresponding to a smaller number of failure categories needs to be increased. The method comprises the following steps: giving an event sequence of a micro service node in a certain fault, randomly selecting an event in the sequence, searching an event (measured by cosine similarity) corresponding to a vector closest to the event vector in a vector space obtained in the step (1), and exchanging the event in the original sequence with the searched event so as to obtain a new sample, wherein the label of the new sample is still the original fault category;

(3) And (3) inputting the new training sample obtained in the step (2) and the original training sample into the model for retraining. It is noted that both unsupervised and supervised training (the label being of the fault type) can be performed in this training step.

Correspondingly, the embodiment steps may specifically include: generating a sample node vector of each micro-service node in the sample micro-service according to the historical alarm event of the sample micro-service in the initial training data set; exchanging at least one sample node vector of the sample microservices in the initial training data set according to root cause microservices node labels or microservices fault type labels of the sample microservices to obtain a training data set after data enhancement processing; and training the event vector generator by using the training data set after the data enhancement processing so as to obtain the trained event vector generator.

Through the steps, the event vector generator can be obtained through training, and for the events appearing in the training data, the vector corresponding to the event can be found in the obtained vector space. For a few events in the verification data, the training data and the online fault diagnosis, the trained event vector generator cannot generate corresponding vectors, the few events can be ignored in the fault diagnosis, and when the number of unknown events is large, a new batch of data can be selected to retrain the model (usually, the data is acquired regularly and the model is updated regularly). Further, the alarm event sequence of each micro service node may be input to the trained event vector generator, and the event vector generator may be used to output the event vector of each alarm event in the alarm event sequence.

204. And calculating the node vector of the micro service node corresponding to each alarm event based on the event vector.

For this embodiment, when calculating the node vector of the micro service node corresponding to each alarm event, different weights may be given to each alarm event according to a certain rule. Then, a weighted mean value of the event vector of each micro service node is obtained according to the event weight to be used as a vector representation of the micro service node, namely a node vector. In this embodiment, two rules are exemplarily proposed for calculating the weight of event alarms on the microservice node during a fault:

(1) Event Frequency (Event Frequency): the frequency of alarm events occurring on a microservice node during a period of time of occurrence of a fault. When the frequency of an event is high, the corresponding weight of the event does not need to be set to be high, because the alarm event generated during the fault can have many irrelevant events, and therefore the importance degree of the event also needs to be analyzed;

(2) Event Importance (Event Importance): through analysis of all historical failure cases, if an event occurs in only a few cases, the event is generally considered to be more important, that is, if an event occurs in both a normal microservice node and a root microservice node, the event is generally considered to be less important, and if an event occurs in the root microservice node but does not occur in the normal microservice node, the event is considered to be more important.

For an event e during a fault _i Micro service node c corresponding to the event _j The event frequency and event importance can be calculated using the following formula:

wherein n is _i,j Is an event e _i At microserver node c _j The number of occurrences in the sequence of events of (c),

is a micro service node c _j The number of alarm events occurring in; | C | is the product of the number of historical fault cases and the number of micro-service nodes in the system, | { k | e |, and _i ∈c _k is an alarm event e contained in a historical fault case _i The number of microservice nodes.

Finally, comprehensively considering the event frequency and the event importance degree to obtain an event e _i Relative to microservice node c _j Event Weight (Event Weight) calculation formula:

EW _i,j ＝EF _i,j ×EI _i

with the event weight, the micro service node c can be obtained _j Vector representation of

Wherein, the micro service node c _j There are K events, the K event vector is V _k And (4) showing.

Correspondingly, for the present embodiment, the steps of the embodiment may specifically include: calculating the weight index of each alarm event, wherein the weight index at least comprises event frequency and event importance degree; calculating the event weight of each alarm event relative to the micro service node corresponding to each alarm event according to the weight index of each alarm event; and calculating a weighted mean value of each alarm event based on the event vector and the event weight, wherein the weighted mean value is used as a node vector of the micro service node corresponding to each alarm event.

205. And modeling to generate a target homogeneous graph topological structure and a target heterogeneous graph topological structure of the target micro service, and acquiring the relation between micro service nodes in the target micro service.

For this embodiment, the target micro service may be first obtained, which corresponds to the micro service system service flowchart shown in fig. 7 and the micro service deployment diagram shown in fig. 8, and further, the target homogeneous diagram topology structure shown in fig. 9 is obtained by performing modeling processing on the micro service system service flowchart, and the target heterogeneous diagram topology structure shown in fig. 10 is obtained by performing modeling processing on the micro service deployment diagram.

206. And inputting the node vector of each micro service node and the target homogeneous graph topological structure into the trained first convolution neural network model to obtain the positioning result of the root cause micro service node.

In a specific application scenario, the first convolutional neural network model needs to be pre-trained before the steps of the present embodiment are performed. Correspondingly, the embodiment steps may specifically include: generating a sample node vector of each micro-service node in the sample micro-service according to the historical alarm event of the sample micro-service; and taking the sample node vector and the homogeneous graph topological structure of the sample micro service as input features, taking the root cause micro service node label of the sample micro service as a training label, and training the first convolution neural network model to obtain the trained first convolution neural network model. It should be noted that, in order to ensure accurate training of the first convolutional neural network model, data enhancement processing may be performed on the initial training data set first, and the first convolutional neural network model may be trained by further using the sample microservices in the training data set after the data enhancement processing. For the implementation process of performing data enhancement processing on the initial training data set, reference may be made to the related description in step 203 of the embodiment, which is not described herein again.

As shown in fig. 11, when the trained first convolutional neural network model is used to determine the positioning result of the root cause microservice node, the target homogeneous graph topology structure and each node vector obtained in the previous step may be used as input, and each node is embedded by two graph convolutional layers (each graph convolutional layer is followed by a ReLu active layer). The mean of the vectors of the nodes is used as the vector representation of the graph. The graph vectors represent the input as a linear layer, and the output dimensions are the positioning results of the root cause microservice nodes. In the first convolution neural network model, the convolution layer can adopt a GCN graph convolution operator, and for the node i, at the l-th layer of the graph convolution neural network, the convolution operation is carried out by using the following formula:

wherein，

Representing the state of node i at level l, N (i) representing the set of neighbor nodes of node i,

W ^(l) is the weight for training, initialized using Xavier, σ is the activation function, b ^(l) Is the bias term.

207. And inputting the node vectors of all the micro service nodes, the target heterogeneous graph topological structure and the relation among the micro service nodes in the target micro service into the trained second convolution neural network model to obtain the prediction result of the micro service fault type.

Among them, the microservice fault types include but are not limited to:

(1) CPU exception: relating to a CPU system index;

(2) Memory exception: relating to memory system index;

(3) Memory release exception: the abnormal injection represents the change of the use mode of the system memory within a period of time;

(4) Call chain exception: exception injection represents a loss of call chain integrity;

(5) Abnormal injection represents the change of log mode or the sudden change of log error number;

(6) Denial of access exception: the anomaly injection is expressed in the average success rate of the traffic flow.

In a specific application scenario, the second convolutional neural network model needs to be pre-trained before the steps of this embodiment are performed. Correspondingly, the embodiment steps may specifically include: generating a sample node vector of each micro-service node in the sample micro-service according to the historical alarm event of the sample micro-service; and taking the relation among the sample node vectors, the heterogeneous graph topological structure of the sample micro service and the micro service nodes in the sample micro service as input characteristics, taking the micro service fault type label of the sample micro service as a training label, and training a second convolutional neural network model to obtain a trained second convolutional neural network model. It should be noted that, in order to ensure accurate training of the second convolutional neural network model, data enhancement processing may be performed on the initial training data set first, and the second convolutional neural network model may be trained by further using the sample microservice in the training data set after the data enhancement processing. For the implementation process of performing data enhancement processing on the initial training data set, reference may be made to the related description in step 203 of the embodiment, which is not described herein again.

The structure of the second convolutional neural network model (which can also be understood as a fault diagnosis model based on a heterogeneous map) is shown in fig. 12. The input of the model comprises three parts: target heterogeneous graph topological structure, each node vector and edge type. The topological structure of the target heterogeneous graph and the input of each node vector part are the same as those of the first convolution neural network model, the types of edges need to be obtained through different relations among the micro-service nodes, and the output dimensionality is a prediction result of the micro-service fault type. The following takes the GAIA data set as an example to show the type construction manner of the edge.

In the GAIA dataset, there are three relationships common between microservice nodes:

(1) And (5) calling. In microservice communication, when a fault occurs at the upstream of a call, error information (such as error parameters) may be transmitted to the downstream of the call, so that the fault is propagated. Thus, in the topology, the invocation of node B by node a is characterized by edges (a, 'call', B);

(2) Is called. In microservice communication, when a callee fails, an erroneous result may be returned to the caller or no result may be returned, causing the failure to propagate. Thus, in the topology, node B is characterized by an edge (B, 'belled', a) that is invoked by node a;

(3) The containers are shared. In order to depict the relation of different micro service node shared containers, virtual machines, hosts and the like, in the topological structure, two edges (A 'share', 'B'), (B 'share', A) are used to depict the shared container relation of the node A and the node B.

The three types of edges are relatively universal in the micro service system, and for other micro service systems, other types of edges can be defined according to specific conditions and used as the input of the second convolutional neural network model.

In the second convolution neural network model, the convolution layer adopts an R-GCN graph convolution operator, and for the node i, the convolution operation is carried out on the l-th layer of the graph convolution neural network by using the following formula:

wherein the content of the first and second substances,

representing the set of neighbor nodes of node i under the relation r, c _i,r Is a normalized constant (can be taken)

) And sigma is the function of activation, which is,

is the weight of the self-loop,

for regularizing weights in the R-GCN layers.

The micro-service fault diagnosis method based on the graph convolution neural network can acquire alarm events of a target micro-service to be diagnosed in a preset time period before and after a fault based on multi-mode data, and further determine an alarm event sequence of each micro-service node in the target micro-service according to the alarm events; then generating event vectors of all alarm events in the alarm event sequence, and calculating node vectors of micro service nodes corresponding to all alarm events based on the event vectors; and finally, inputting the node vector of each micro service node into the trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type. According to the technical scheme, three modes of micro-service alarm events can be combined, a topological structure is used, accurate positioning of the root cause micro-service nodes and accurate prediction of micro-service fault types are achieved by using the graph convolution neural network, so that time spent by operation and maintenance personnel on fault recovery is reduced, the process from fault occurrence to fault recovery of the micro-service system is accelerated, and loss is reduced to the maximum extent.

Further, as a specific implementation of fig. 1, an embodiment of the present invention provides a microservice fault diagnosis apparatus based on a graph convolution neural network, as shown in fig. 13, the apparatus includes: an acquisition module 31, a determination module 32, a calculation module 33, and an input module 34.

The acquisition module 31 is configured to acquire an alarm event of the target microservice to be diagnosed within a preset time period before and after a fault, where the alarm event is generated based on multi-modal data;

the determining module 32 is configured to determine an alarm event sequence of each microservice node in the target microservice according to the alarm event;

the calculating module 33 is configured to generate an event vector of each alarm event in the alarm event sequence, and calculate a node vector of the micro service node corresponding to each alarm event based on the event vector;

the input module 34 may be configured to input the node vector of each micro service node into the trained fault diagnosis model, and obtain a fault diagnosis result, where the fault diagnosis result at least includes a positioning result of the root micro service node and a prediction result of the micro service fault type.

In a specific application scenario, the alarm events at least include an index alarm event, a log alarm event, and a call chain alarm event, and when determining an alarm event sequence of each micro service node in the target micro service according to the alarm event, the determining module 32 is specifically configured to attribute the index alarm event and the log alarm event to the micro service node where the event occurs, and to attribute the call chain alarm event to both the caller micro service node and the callee micro service node; and uniformly arranging the alarm events on each micro service node according to the sequence of the occurrence of the events to obtain the alarm event sequence of each micro service node.

In a specific application scenario, when generating an event vector of each alarm event in an alarm event sequence, the calculation module 33 may be specifically configured to input the alarm event sequence into a trained event vector generator, and generate an event vector of each alarm event in the alarm event sequence, where the event vector generator is obtained by training with a training data set after data enhancement processing.

In a specific application scenario, to implement data enhancement processing on an initial training data set and training an event vector generator by using the training data set after the data enhancement processing, as shown in fig. 14, the apparatus further includes: first training module 35:

the first training module 35 may be specifically configured to: generating a sample node vector of each micro-service node in the sample micro-service according to the historical alarm event of the sample micro-service in the initial training data set; exchanging at least one sample node vector of the sample microservices in the initial training data set according to root cause microservices node labels or microservices fault type labels of the sample microservices to obtain a training data set after data enhancement processing; and training the event vector generator by using the training data set after the data enhancement processing so as to obtain the trained event vector generator.

In a specific application scenario, when calculating a node vector of a micro service node corresponding to each alarm event based on an event vector, the calculating module 33 may be specifically configured to calculate a weight index of each alarm event, where the weight index at least includes an event frequency and an event importance degree; calculating the event weight of each alarm event relative to the micro service node corresponding to each alarm event according to the weight index of each alarm event; and calculating a weighted mean value of each alarm event based on the event vector and the event weight to serve as a node vector of the micro service node corresponding to each alarm event.

In a specific application scenario, the fault diagnosis model includes a first convolutional neural network model based on a homogeneous graph and a second convolutional neural network model based on a heterogeneous graph, and before the node vector of each micro service node is input into the fault diagnosis model after training and a fault diagnosis result is obtained, to implement pre-training of the fault diagnosis model, as shown in fig. 14, the apparatus further includes: a second training module 36;

the second training module 36 may be specifically configured to: generating a sample node vector of each micro-service node in the sample micro-service according to the historical alarm event of the sample micro-service; taking the sample node vector and the homogeneous graph topological structure of the sample micro service as input features, taking a root cause micro service node label of the sample micro service as a training label, and training a first convolution neural network model to obtain a trained first convolution neural network model; and taking the relation among the sample node vectors, the heterogeneous graph topological structure of the sample micro service and the micro service nodes in the sample micro service as input characteristics, taking the micro service fault type label of the sample micro service as a training label, and training a second convolutional neural network model to obtain a trained second convolutional neural network model.

In a specific application scenario, as shown in fig. 14, the apparatus further includes: a generation module 37;

the generating module 37 is configured to model and generate a target homogeneous graph topology structure and a target heterogeneous graph topology structure of the target microservice, and obtain a relationship between microservice nodes in the target microservice;

in a specific application scenario, the input module 34 is specifically configured to input the node vector of each micro service node and the target homogeneous graph topology structure into the trained first convolutional neural network model, so as to obtain a positioning result of the root cause micro service node; and inputting the node vectors of all the micro service nodes, the target heterogeneous graph topological structure and the relation among the micro service nodes in the target micro service into the trained second convolutional neural network model to obtain a prediction result of the micro service fault type.

It should be noted that other corresponding descriptions of the functional modules involved in the micro-service fault diagnosis device based on the graph convolution neural network provided in the embodiment of the present invention may refer to the corresponding description of the method shown in fig. 1, and are not described herein again.

Based on the method shown in fig. 1, correspondingly, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps: acquiring alarm events of target micro-services to be diagnosed in a preset time period before and after a fault, wherein the alarm events are generated based on multi-mode data; determining an alarm event sequence of each micro-service node in the target micro-service according to the alarm event; generating an event vector of each alarm event in the alarm event sequence, and calculating a node vector of a micro service node corresponding to each alarm event based on the event vector; and inputting the node vector of each micro service node into the trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type.

Based on the above embodiments of the method shown in fig. 1 and fig. 2 and the apparatus shown in fig. 13 and fig. 14, an embodiment of the present invention further provides an entity structure diagram of a computer device, as shown in fig. 15, the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are arranged on a bus 43 such that the following steps are performed when the processor 41 executes the program: acquiring alarm events of target micro-services to be diagnosed in a preset time period before and after a fault, wherein the alarm events are generated based on multi-mode data; determining an alarm event sequence of each micro-service node in the target micro-service according to the alarm event; generating event vectors of all alarm events in the alarm event sequence, and calculating node vectors of micro service nodes corresponding to all alarm events based on the event vectors; and inputting the node vector of each micro service node into the trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result at least comprises a positioning result of the root micro service node and a prediction result of the micro service fault type.

The embodiment of the invention can combine the micro-service alarm events of three modes and use a topological structure, and utilizes the graph convolution neural network to realize the accurate positioning of the root cause micro-service node and the accurate prediction of the micro-service fault type, thereby reducing the time spent by operation and maintenance personnel on recovering the fault, accelerating the process from the fault occurrence to the fault recovery of the micro-service system and further reducing the loss to the maximum extent.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The present invention has been described in terms of the preferred embodiment, and it is not intended to be limited to the embodiment. Any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A microservice fault diagnosis method based on a graph convolution neural network is characterized by comprising the following steps:

2. The method of claim 1, wherein the alarm events at least include indicator alarm events, log alarm events, and call chain alarm events, and wherein determining the sequence of alarm events for each microservice node in the target microservice based on the alarm events comprises:

attributing the index alarm event and the log alarm event to a micro service node where the event occurs, and attributing the calling chain alarm event to a calling micro service node and a called micro service node at the same time;

and uniformly arranging the alarm events on each micro service node according to the sequence of the occurrence of the events to obtain an alarm event sequence of each micro service node.

3. The method of claim 1, wherein generating an event vector for each alarm event in the sequence of alarm events comprises:

inputting the alarm event sequence into a trained event vector generator, and generating an event vector of each alarm event in the alarm event sequence, wherein the event vector generator is obtained by training with a training data set after data enhancement processing.

4. The method of claim 3, wherein prior to inputting the sequence of alarm events into a trained event vector generator and generating an event vector for each alarm event in the sequence of alarm events, the method further comprises:

generating a sample node vector of each micro-service node in the sample micro-service according to a historical alarm event of the sample micro-service in an initial training data set;

exchanging at least one sample node vector of the sample micro service in the initial training data set according to the root cause micro service node label or the micro service fault type label of the sample micro service to obtain a training data set after data enhancement processing;

and training an event vector generator by using the training data set after the data enhancement processing so as to obtain the trained event vector generator.

5. The method of claim 1, wherein calculating a node vector of the micro service node corresponding to each alarm event based on the event vector comprises:

calculating a weight index of each alarm event, wherein the weight index at least comprises an event frequency and an event importance degree;

according to the weight indexes of all the alarm events, calculating the event weight of each alarm event relative to the micro service node corresponding to each alarm event;

and calculating a weighted mean value of each alarm event based on the event vector and the event weight to serve as a node vector of the micro service node corresponding to each alarm event.

6. The method of claim 1, wherein the fault diagnosis model comprises a first convolutional neural network model based on a homogeneous graph and a second convolutional neural network model based on a heterogeneous graph, and before inputting the node vector of each micro service node into the trained fault diagnosis model and obtaining a fault diagnosis result, the method further comprises:

generating a sample node vector of each micro-service node in the sample micro-service according to a historical alarm event of the sample micro-service;

taking the sample node vector and the homogeneous graph topological structure of the sample micro-service as input features, taking a root cause micro-service node label of the sample micro-service as a training label, and training the first convolution neural network model to obtain a trained first convolution neural network model; and the number of the first and second groups,

and taking the relation among the sample node vectors, the heterogeneous graph topological structure of the sample micro service and the micro service nodes in the sample micro service as input characteristics, taking the micro service fault type label of the sample micro service as a training label, and training the second convolutional neural network model to obtain a trained second convolutional neural network model.

7. The method according to claim 2, before inputting the node vector of each microservice node into the trained fault diagnosis model and obtaining a fault diagnosis result, further comprising:

modeling to generate a target homogeneous graph topological structure and a target heterogeneous graph topological structure of the target micro service, and acquiring the relationship between micro service nodes in the target micro service;

inputting the node vector of each micro service node into a trained fault diagnosis model to obtain a fault diagnosis result, wherein the fault diagnosis result comprises the following steps:

inputting the node vector of each micro service node and the target homogeneous graph topological structure into a trained first convolution neural network model to obtain a positioning result of the root cause micro service node;

and inputting the node vector of each micro service node, the target heterogeneous graph topological structure and the relation among the micro service nodes in the target micro service into the trained second convolutional neural network model to obtain a prediction result of the micro service fault type.

8. A microservice fault diagnosis device based on a graph convolution neural network is characterized by comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.