CN114296975A

CN114296975A - Distributed system call chain and log fusion anomaly detection method

Info

Publication number: CN114296975A
Application number: CN202111583157.8A
Authority: CN
Inventors: 彭鑫; 张晨曦
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-08

Abstract

The invention belongs to the technical field of software engineering and cloud computing, and particularly relates to a distributed system call chain and log fusion anomaly detection method. The method is based on a call chain and log data during the operation of a distributed system, a call chain event relation graph is constructed according to the call chain and the log data, a call chain event relation graph mode during the normal operation of the system is learned by using a graph neural network and a single-classification deep learning method, a newly generated call chain event relation graph is detected in real time during online use, and a call chain generating abnormal behaviors is identified; the method specifically comprises the following steps: analyzing log events, analyzing call chain events, vectorizing events, constructing a call chain event relation graph, training a graph neural network model and detecting online anomalies; the invention can help operation and maintenance personnel and developers to quickly find system abnormity, generate corresponding alarm information, accelerate the speed of fault location and on-line problem solution and reduce labor cost.

Description

Distributed system call chain and log fusion anomaly detection method

Technical Field

The invention belongs to the technical field of software engineering and cloud computing, and particularly relates to a distributed system call chain and log anomaly detection method.

Background

The distributed system decomposes the application program into a plurality of independent modules, each module is provided with a separate process and a running environment, and the processes are communicated with each other through a network. The micro-service architecture developed by a distributed system becomes an important component of a cloud native technology, and the micro-service can be independently developed, independently deployed and flexibly stretched based on fine-grained function division and a distributed operating environment, and most enterprises adopt the distributed or micro-service architecture to realize application.

The anomaly detection is an important component for monitoring during the operation of the system, and the rapid and accurate anomaly detection and discovery can help the system to rapidly discover problems and avoid serious consequences caused by fault propagation. When the single system is abnormal, developers can monitor and find the abnormality through logs or indexes. The distributed system has cross-process interaction, so that the traditional log anomaly detection technology has poor effect. Due to the occurrence of distributed tracking, operation and maintenance and developers can observe a cross-process interaction mode of the distributed system, but the distributed tracking system focuses on interaction behaviors among processes, and log and distributed tracking are difficult to effectively combine, so that the distributed system is extremely difficult to detect abnormity.

Disclosure of Invention

The invention aims to provide a distributed system call chain and log fusion anomaly detection method based on a graph neural network, which can quickly detect the abnormal operation behavior of a distributed system.

The method uses a call chain and log data of a distributed system during operation to construct a call chain event relation graph for describing the relation between system call and an operation log in the distributed system; according to the invention, historical data is collected, a single-classification anomaly detection model based on a graph neural network is trained, and after the single-classification anomaly detection model is deployed to an online system, an anomaly calling chain can be detected in real time, so that system problems can be quickly found.

The invention mainly comprises six parts: the method comprises the steps of log event analysis, call chain event analysis, event vectorization, call chain event relation graph construction, graph neural network model training and online anomaly detection. The method comprises the following specific steps:

(1) and analyzing the log events. The log event refers to a system event represented by a log statement printed during the running of the program. A log statement is usually composed of a fixed part (log template) and a variable part, this step parses the original log data and uses the log template to represent different log events. The method specifically comprises the following substeps:

1) and collecting the logs in the running process of the system through a distributed log collecting tool.

2) And analyzing the log data by using a log template analysis algorithm (Drain), and acquiring a log template corresponding to each log as an event description.

3) And extracting the traceID and the spanID corresponding to each log, and associating the traceID and the spanID with each log.

(2) Call chain event resolution. The call chain event refers to an event generated when the distributed system calls across processes, such as a client sending a synchronous call request, a server receiving the synchronous call request, a producer producing an asynchronous call message, a consumer consuming the asynchronous call message, and the like. This step parses the original call chain data and divides the original data into different types of call chain events. The method specifically comprises the following substeps:

1) analyzing each span data of the Client/Server type as a Request event and a Response event, and analyzing and obtaining four types of events of Client Request/Server Request/Client Response/Server Response by taking a span name of a span type (Client/Server) event (Request/Response) as an event description. And simultaneously recording the occurrence time of the event, wherein the occurrence time of the request event is the starting time of the span, and the occurrence event of the response event is the ending time of the span.

2) Analyzing each span data of the Producer/Consumer type into a Producer event and a Consumer event, and analyzing to obtain two event types of the Producer/Consumer by taking a "span type (Producer/Consumer span name)" as an event description. And simultaneously recording the occurrence time of the event, wherein the occurrence time of the Producer event is the starting time of the Producer type span, and the occurrence time of the Consumer event is the starting time of the Consumer type span.

(3) And (5) vectorizing the event. Event vectorization refers to mapping log events and call chain events to a vector space and representing the log events and the call chain events in vectors, so that a deep learning model can process the events and can reflect semantic information of the events. The method specifically comprises the following substeps:

1) and (4) preprocessing event description. And removing stop words and non-character symbols in the event description, and splitting the combined words.

2) Word embedding. Each word in the event description is mapped to the same vector space using a word embedding model and represented as a vector.

3) And embedding sentences. And calculating a vector corresponding to each event description through all word vectors in each event description. And performing weighted combination on the word vectors by using TF-IDF, wherein fewer words are generated and have higher weight, and a sentence vector is obtained.

(4) And constructing a call chain event relation graph. And associating the system call and the log printing behavior of the distributed system during the operation by using the call chain event relation diagram, wherein the purpose is to describe the operation behavior state of the distributed system. The raw data includes runtime system call chain data and runtime log data for each distributed software or service. The call chain event relation graph comprises two nodes, namely a log event node and a call chain event node; the relationships between nodes include a sequential relationship, a synchronous call relationship, a synchronous response relationship, and an asynchronous call relationship, corresponding to edges in the graph. A typical call chain event relationship diagram is shown, for example, in fig. 1. The construction of the call chain event relation graph comprises the following sub-steps:

1) the log events are linked. For each span in a call chain, all log events belonging to this span are retrieved. The log events are sorted according to the time stamp, and an edge of a sequential relationship is added between each log event and the next log event.

2) Insert span events. For each span in a call chain, all call chain events belonging to the span are fetched. And each call chain event is inserted into the log event sequence according to the occurrence time, and an edge with a sequence relation is added to the event adjacent to the call chain event.

3) And connecting span. For all spans in a call chain, the call chain event relationships are connected according to their parent span relationships. For the Client/Server type span, for the event of the ServerRequest type, the event is pointed to from the event of the ClientRequest type of the parent span and the edge of a synchronous call relation. For the event of the ServerResponse type, the event of the ClientResponse type of the parent span is pointed to from the event to the edge of the synchronous response relation. For the span of the Producer/Consumer type, for the event of the Consumer type, the event is pointed to from the event of the Producer type of the parent span and an edge of an asynchronous calling relation.

(5) And training an anomaly detection model based on the graph neural network. And inputting call chain event relational graph data generated during normal operation of the system into a graph neural network to train a single-classification anomaly detection model. The method uses Gated Graph Neural Network (GGNN) and deep support vector data description (DeepSVDD) to train an anomaly detection model; the method comprises the following steps: inputting training data into a gated graph neural network to obtain vector representation; obtaining a vector representation h of each call chain event relation graph by using soft attention mechanism calculation according to the node vector representation_g(ii) a Describing and training an anomaly detection model by using depth support vector data; using the depth support vector data description to enable the neural network of the gated graph to learn effective graph vector representation, and enabling most of training data vector representations to be in the same hypersphere, so that the relation of normal call chain events is reflected correctly;

the training of the anomaly detection model specifically comprises the following substeps:

1) and inputting the training data into a gated graph neural network to obtain vector representation. And (3) sequentially processing each call chain event relation graph by using a gated graph neural network, and obtaining vector representation of each node in each call chain event relation graph in a vector space based on information propagation. Inputting a call chain event relation graph g ═ V, A, X, wherein V represents a node (event) set; a represents the adjacency matrix of the graph; x is formed by R^|V|×dFor the node attribute matrix, each row x_vAttributes (event vectors) representing the node v.

The specific calculation method is as follows:

wherein the content of the first and second substances,

vector representation for nodes through information propagation;

the corresponding rows and columns in the adjacency matrix for the outgoing and incoming edges of node v.

Obtaining a vector representation h of each call chain event relation graph by using soft attention mechanism calculation according to the node vector representation_gThe specific calculation formula is as follows:

2) training the anomaly detection model using the deep support vector data description. And the gate control graph neural network is made to learn effective graph vector representation by using the deep support vector data description, and most of training data vector representations are in the same hypersphere, so that the relation of normal call chain events is correctly reflected. The specific loss function is as follows:

wherein c is the center of the hypersphere; r is the radius of the hypersphere; the hyperparameter μ controls the proportion of the call chain event graph in the training set that lies outside the hypersphere.

During training, Adam is used to optimize the gated graph neural network parameters. And finding the current optimal radius R by using linear search every k rounds, wherein the specific value is calculated from the (1-mu) quantiles from all samples in the current training sample to the center of the hypersphere.

(6) And (4) online anomaly detection. And (3) deploying the trained model in the system, and when a new call chain is generated, sequentially executing the steps (1) to (4) to obtain a corresponding call chain event relation graph, and inputting the graph into the anomaly detection model to obtain vector representation of the graph. And calculating the distance between the point and the hypersphere as an abnormal score by using the vector, wherein a specific calculation formula is as follows:

if the point is located within the hypersphere, ans (h)_g)<0, the calling chain is considered as a normal calling chain. If the point is located outside the hypersphere, ans (h)_g)>And 0, the calling chain is considered to be abnormal, and the system generates an alarm to remind operation and maintenance personnel. Meanwhile, based on the soft attention mechanism in the step (5), the system can visualize the abnormal call chain event graph and mark the nodes with high attention scores as dark colors.

The advantages and the characteristics of the invention are mainly as follows:

(1) the invention can simultaneously detect log abnormity and call chain abnormity, find system problems in real time for operation and maintenance personnel to refer, and accelerate the fault finding speed and range.

(2) The invention represents log events and call chain events in association and distributed systems by a unified graph, and can support various analysis technologies and applications.

(3) The invention trains the abnormal detection model by using an unsupervised method, only needs to monitor data when the system operates normally, does not depend on fault data, and has good generalization capability.

(4) The method can improve the accuracy of anomaly detection, and experiments based on the open source micro-service reference system TrainTicket show that the accuracy of the method reaches 93 percent, the recall rate reaches 97 percent, and the score is averagely improved by 0.37 compared with the F1 scores of other methods.

Drawings

FIG. 1 is a diagram illustrating call chain event relationships constructed in accordance with the present invention.

FIG. 2 is a flow chart of the method of the present invention.

Detailed Description

The Skywalk acquisition is used as an application performance monitoring platform, the PyTorch is used as a distributed system of a deep learning framework, and the method for detecting the fusion abnormality of the call chain and the log of the distributed system based on the graph neural network is further introduced.

For the training of the anomaly detection model, the specific process is as follows:

(1) call chains and log data are collected. And configuring a Skywalk agent for each program in the distributed system, and setting a call chain and a log collection rule. Call chain and log data generated by normal operation of the system are collected and stored in an elastic search as training data.

(2) And constructing a call chain event relation graph data set. And (3) processing the collected training data according to the sequence of the steps (1) to (4), constructing a corresponding call chain event relation graph for each call chain, and using all the processed data as a graph data set for a deep learning model.

(3) And training an anomaly detection model. The pytorech is used to realize the depth support vector data description model based on the gated graph neural network. The model was trained based on the graph dataset for 100 rounds, with 3 layers of gated graph neural networks, a learning rate of 0.0001, and μ ═ 0.05.

For online anomaly detection, the specific flow is as follows:

(1) an anomaly detection process is triggered. The exception detection process is triggered when SkyWalking gathers new call chain data.

(2) And constructing an event relation graph. And (4) processing the call chain and log data related to the call chain according to the sequence of the steps (1) to (4), and constructing a corresponding call chain event relation graph for the call chain.

(3) And (4) detecting the abnormality. And inputting the call chain event relation graph into an anomaly detection model, obtaining an anomaly score of the call chain event relation graph, outputting the call chain as an abnormal call chain if the anomaly score is greater than 0, and troubleshooting the fault by operation and maintenance personnel according to the output.

The common 14 different types of faults are injected into the open source micro service reference system TrainTicket and an abnormity detection comparison experiment is carried out, so that the score of the method is improved by 0.37 on average compared with the existing log abnormity detection or calling chain abnormity detection method F1.

Claims

1. A distributed system call chain and log fusion anomaly detection method based on a graph neural network is characterized in that a call chain event relation graph is constructed by using call chain and log data during the operation of the distributed system, and is used for describing the relation between system call and operation logs in the distributed system; training a single-classification anomaly detection model based on a graph neural network by collecting historical data; deploying the trained anomaly detection model to an online system, namely detecting an anomaly calling chain in real time and quickly finding system problems; the method comprises the following specific steps:

(1) parsing log events

The log event refers to a system event represented by a log statement printed during program running; one log statement consists of a fixed part, namely a log template, and a variable part; analyzing the log events, namely analyzing original log data, and representing different log events by using a log template;

(2) parsing call chain events

The call chain event refers to an event generated when the distributed system is called in a cross-process mode, and comprises a synchronous call request sent by a client, a synchronous call request received by a server, an asynchronous call message generated by a producer and an asynchronous call message consumed by a consumer; analyzing the call chain event, namely analyzing original call chain data, and dividing the original data into different types of call chain events;

(3) event vectorization

The event vectorization is to map the log events and the call chain events to a vector space and represent the log events and the call chain events by vectors, so that a deep learning model can process the events and can reflect semantic information of the events;

(4) constructing a call chain event relationship graph

The method comprises the steps that a call chain event relation graph is used for correlating system call and log printing behaviors during the operation of the distributed system, and is used for describing the operation behavior state of the distributed system; the original data comprises runtime system call chain data and runtime log data of each distributed software or service; the call chain event relation graph comprises two nodes, namely a log event node and a call chain event node; the relationships among the nodes comprise a sequential relationship, a synchronous calling relationship, a synchronous response relationship and an asynchronous calling relationship, and correspond to edges in the graph;

(5) training graph neural network model

Inputting call chain event relational graph data generated during normal operation of the system into a graph neural network to train a single-classification anomaly detection model; the method comprises the following steps: inputting training data into a gated graph neural network to obtain vector representation; obtaining a vector representation h of each call chain event relation graph by using soft attention mechanism calculation according to the node vector representation_g(ii) a Describing and training an anomaly detection model by using depth support vector data; using the depth support vector data description to enable the neural network of the gated graph to learn effective graph vector representation, and enabling most of training data vector representations to be in the same hypersphere, so that the relation of normal call chain events is reflected correctly; recording the center of the hypersphere as c and the radius as R;

(6) online anomaly detection

Deploying the trained model in a system, and when a new call chain is generated, sequentially executing the steps (1) to (4) to obtain a corresponding call chain event relation graph, and inputting the graph into an anomaly detection model to obtain vector representation of the graph; and calculating the distance between the point and the hypersphere as an abnormal score by using the vector, wherein a specific calculation formula is as follows:

if the point is located within the hypersphere, ans (h)_g)<0, the calling chain is considered as a normal calling chain; if the point is located outside the hypersphere, ans (h)_g)>0, then the call chain is consideredWhen the abnormity happens, the system can generate an alarm to remind the operation and maintenance personnel; and (5) simultaneously, based on the soft attention mechanism in the step (5), visualizing the abnormal call chain event graph by the system, and marking the nodes with high attention scores as dark colors.

2. The method for detecting the abnormal fusion of the call chain and the log of the distributed system based on the graph neural network as claimed in claim 1, wherein the analyzing the log event in the step (1) specifically comprises the following sub-steps:

1) collecting logs in the running process of the system through a distributed log collecting tool;

2) analyzing the log data by using a log template analysis algorithm, and acquiring a log template corresponding to each log as an event description;

3. The method for detecting the fusion anomaly of the call chain and the log of the distributed system based on the graph neural network as claimed in claim 2, wherein the step (2) of analyzing the call chain event specifically comprises the following substeps:

1) analyzing each span data of the Client/Server type as a Request event and a Response event, and analyzing to obtain four types of events of Client Request/Server Request/Client Response/Server Response by taking a span name (Request/Response) of the span type (Client/Server) as an event description; simultaneously recording the occurrence time of the event, wherein the occurrence time of the request event is the starting time of the span, and the occurrence event of the response event is the ending time of the span;

2) analyzing each span data of the Producer/Consumer type into a Producer event and a Consumer event, and analyzing to obtain two event types of the Producer/Consumer by taking a "span type (Producer/Consumer span name)" as an event description; and simultaneously recording the occurrence time of the event, wherein the occurrence time of the Producer event is the starting time of the Producer type span, and the occurrence time of the Consumer event is the starting time of the Consumer type span.

4. The method for detecting the fusion anomaly of the call chain and the log of the distributed system based on the graph neural network as claimed in claim 3, wherein the event vectorization in the step (3) specifically comprises the following sub-steps:

1) preprocessing the event description; removing stop words and non-character symbols in the event description, and splitting the combined words;

2) word embedding; mapping each word in the event description to the same vector space by using a word embedding model and representing the word in a vector;

3) sentence embedding; calculating a vector corresponding to each event description through all word vectors in each event description; and performing weighted combination on the word vectors by using TF-IDF, wherein fewer words are generated and have higher weight, and a sentence vector is obtained.

5. The method for detecting the fusion anomaly of the call chain and the log of the distributed system based on the graph neural network as claimed in claim 4, wherein the step (4) of constructing the call chain event relation graph specifically comprises the following sub-steps:

1) linking the log events; for each span in a call chain, acquiring all log events belonging to the span; sequencing the log events according to the time stamps, and adding an edge with a sequence relation between each log event and the next log event;

2) inserting a span event; for each span in a call chain, acquiring all call chain events belonging to the span; inserting each call chain event into a log event sequence according to the occurrence time, and adding an edge with a sequence relation with the adjacent events;

3) connecting span; for all the spans in one calling chain, connecting calling chain event relations according to the parent span relations of the spans; for a Client/Server type span, for a ServerRequest type event in the span, connecting an edge of a synchronous call relationship with the ClientRequest type event of a parent span of the parent span to point to the event; for the event of the ServerResponse type, the event is connected with an edge of a synchronous response relation and points to the event of the ClientResponse type of the parent span of the event; for the span of the Producer/Consumer type, for the event of the Consumer type, the event is pointed to from the event of the Producer type of the parent span and an edge of an asynchronous calling relation.

6. The method for detecting the fusion abnormality of the call chain and the log of the distributed system based on the graph neural network as claimed in claim 5, wherein the training graph neural network model in the step (5) specifically comprises the following sub-steps:

1) inputting training data into a gated graph neural network to obtain vector representation; sequentially processing each call chain event relational graph by using a gated graph neural network, and obtaining vector representation of each node in each call chain event relational graph in a vector space based on information propagation; inputting a call chain event relation graph g ═ V, A, X, wherein V represents a node, namely a set of events; a represents the adjacency matrix of the graph; x is formed by R^|V|×dFor the node attribute matrix, each row x_vThe attribute representing the node v is specifically calculated as follows:

wherein the content of the first and second substances,

vector representation for nodes through information propagation;

for the egress of node vThe edges and the incoming edges are adjacent to corresponding rows and columns in the matrix;

2) describing and training an anomaly detection model by using depth support vector data; using the depth support vector data description to enable the neural network of the gated graph to learn effective graph vector representation, and enabling most of training data vector representations to be in the same hypersphere, so that the relation of normal call chain events is reflected correctly; the specific loss function is as follows:

wherein c is the center of the hypersphere; r is the radius of the hypersphere; the vector of the calling chain event relation graph in the hyper-parameter mu control training set represents the proportion outside the hyper-sphere;

during training, optimizing parameters of a neural network of a gated graph by using Adam; and finding the current optimal radius R by using linear search every k rounds, wherein the specific value is calculated from the (1-mu) quantiles from all samples in the current training sample to the center of the hypersphere.