CN117827508A

CN117827508A - Abnormality detection method based on system log data

Info

Publication number: CN117827508A
Application number: CN202311736838.2A
Authority: CN
Inventors: 张新野; 于铭华; 邱定
Original assignee: CETC 32 Research Institute
Current assignee: CETC 32 Research Institute
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-04-05

Abstract

The invention discloses an anomaly detection method based on system log data, which effectively solves some problems still faced in the conventional anomaly detection algorithm based on logs. The method adopts a Drain log analysis method to analyze original log information. It can parse unstructured log data into log templates that contain only critical information. Dividing the template sequence by using a sliding window algorithm, and using the obtained event sequence for inputting an anomaly detection model. And an unsupervised anomaly detection algorithm is adopted, so that the problem of rare marked data can be solved, the anomaly detection process is simplified, and the detection speed is improved. The POT algorithm is used for calculating the threshold value in the aspect of threshold value selection, so that the detection result is prevented from being interfered by manually setting the threshold value.

Description

Abnormality detection method based on system log data

Technical Field

The invention relates to an anomaly detection method based on a system log, and belongs to the field of intelligent operation and maintenance anomaly detection.

Background

In the field of intelligent operation and maintenance, detection of abnormal behaviors in a system log is an important task. A log is information recorded by a computer system, a device, software, or the like in some case. Is a valuable resource for detecting system anomalies, debugging the system, optimizing the performance of the system, or adjusting the behavior of the system based on such information. With the increasing size and complexity of computer systems, the number of logs can reach millions, which presents challenges for administrators to fully understand the system state and to efficiently detect anomalies. In cases where manual log inspection is not feasible and labor is expensive, automated log anomaly detection is an urgent task and is also a valuable research topic.

The workflow of a log-based anomaly detection algorithm is presented in FIG. 1. First, the system will collect a large amount of log data that originates from the records and reports of the various system components. Next, the algorithm needs to preprocess the logs, including links such as analysis and feature extraction of the logs. The analysis operation converts the original log data into a structured log event, and the feature extraction is performed by extracting key features from the log, and then inputting the processed data into an anomaly detection model for anomaly detection.

In log-based anomaly detection algorithms, the prior art can be divided into three categories: graph-based anomaly detection, probability analysis-based anomaly detection, and machine learning-based anomaly detection. Modeling the sequence relation, the incidence relation and the log text content of the log based on the anomaly detection of the graph model; the abnormal detection based on probability statistics adopts association analysis, comparison and the like, and the association probability of the log and the abnormality is calculated; and (3) based on abnormal detection of machine learning, a clustering algorithm is adopted to find out an outlier (outlier) or a classification algorithm is adopted to learn log modes when faults occur, and whether the online log data accords with the log modes is judged.

The log abnormality detection method based on machine learning can be divided into two types, namely supervised and unsupervised. Supervised log anomaly detection includes LOGROBUST, SVM, LR, decision trees, etc. Because of the difficulty in marking log data in reality, the invention only focuses on unsupervised log anomaly detection.

The unsupervised anomaly detection system can be further divided into an offline unsupervised anomaly detection system and an online unsupervised anomaly detection system. The offline unsupervised anomaly detection algorithm is mainly used for debugging or detection after the anomaly occurs. The data used for log analysis is the data collected before the system. Xu et al used PCA to detect abnormalities. After the whole log file is analyzed, a state ratio vector and a message count vector are constructed, and then the PCA is used for abnormality detection. Lou et al utilized Invariant Mining (IM), mining the invariants in the log using a singular value decomposition method, and set a threshold value for comparison with each Invariant candidate. However, the offline algorithm needs to collect log data for a certain time, has high delay, and cannot respond to abnormal conditions in real time, and may cause continuous occurrence or expansion of the abnormal conditions. And the offline algorithm needs to store a large amount of log data and perform a large amount of computation, and needs a large amount of storage space and a large amount of computing resources. In log anomaly detection with a very large data volume, the offline anomaly detection algorithm is not practical.

The online unsupervised anomaly detection algorithm mainly performs anomaly detection in real time when events are recorded. Recently, deep learning techniques have been introduced into log anomaly detection, and Du et al propose an anomaly detection framework deep that models log key sequences using LSTM models. The Log analytical adopts a Template2Vec method to extract semantic information hidden in a Log Template, and a neural network is used for detecting continuous and quantitative Log abnormal conditions. HUANG et al devised a log sequence encoder and a parameter value encoder to obtain their respective representations, and then used an attention mechanism as the final classification model. However, the fixed threshold is adopted by the model algorithms, so that the condition of missed detection or false detection can be possibly caused.

Disclosure of Invention

While many log-based anomaly detection algorithms exist today, many problems remain unsolved with existing algorithms and still face many challenges. For example: (1) Unstructured log data and their format and language may vary significantly from system to system. Even if it is known that an error has occurred, it has been difficult to diagnose a problem using unstructured logs. (2) an unsupervised anomaly detection algorithm: the noted logs are very scarce, so although there is a high accuracy of the supervised learning-based approach, manually noting anomalies is time consuming and cumbersome due to the amount and speed of log data. Thus, log anomaly detection should be done in an unsupervised manner. (3) stream processing: the log is a data stream, and real-time detection is more in line with actual requirements than post-hoc analysis.

In order to solve the technical problems, the technical scheme of the invention provides an anomaly detection method based on system log data, which is characterized by comprising an offline training stage and an online detection stage, wherein:

the offline training phase comprises the following steps:

step 101, analyzing an unstructured original log by adopting a Drain log analyzer, and converting the original log into a structured log template;

step 102, after log analysis, the original log is analyzed into an event sequence, and the event sequence is further divided by using a sliding window method;

step 103, training an LSTM-VAE anomaly detection model, replacing neurons in an encoding layer and a decoding layer of the VAE by LSTM neurons in the LSTM-VAE anomaly detection model, namely extracting long-short-term dependency relations in input data by LSTM, and modeling log data through variational reasoning of the VAE;

104, after training the LSTM-VAE anomaly detection model, obtaining a reconstruction sequence of the input data, and obtaining a reconstruction error of the data by calculating the difference between the reconstruction sequence and the original sequence;

step 105, inputting the reconstruction error sequence obtained in the step 104 into a POT algorithm, and automatically calculating by the POT algorithm to obtain a threshold alpha;

the online detection stage comprises the following steps:

step 201, sending log data generated in real time to a Drain log analyzer through a log collector for analysis, and then carrying out event sequence division by using a sliding window algorithm;

step 202, inputting the event sequence obtained in the step 201 into a trained LSTM-VAE anomaly detection model to obtain a reconstruction error;

step 203, comparing the threshold value α obtained in the offline training stage with the reconstruction error obtained in step 202, and if the reconstruction error is greater than the threshold value α, determining that an abnormality is detected.

Preferably, in step 101, the Drain log parser extracts a log template from an original log message, and splits the log template into disjoint log groups, where the Drain log parser uses a parse tree with a fixed depth to guide a log group search process.

Preferably, the step 101 further comprises the steps of:

(1) preprocessing according to domain knowledge:

the Drain log parser obtains simple regular expressions provided by a user and based on domain knowledge representing common variables, and deletes marks matched by the regular expressions from an original log message;

(2) searching according to the length of the log information:

processing the log data from the root node of the parse tree and using the preprocessed log information;

(3) searching according to the previous mark:

traversing from the first level node searched in step (2) to the leaf node, wherein the Drain log parser selects the next internal node by the marker in the log message start position and only the markers that do not include digits are considered in this step, if a marker contains digits, it will match a particular internal node ". Times";

(4) searching by labeled similarity:

before this step, drain has traversed a leaf node containing a list of log groups, the log messages in these log groups following rules encoded in nodes inside the path, drain log parser selecting the most appropriate log group from the list of log groups;

(5) updating the parse tree:

if an appropriate log group is returned in step (4), the Drain log parser adds the log ID of the current log message to the log ID in the returned log group, and in addition, the Drain log parser will update the log events in the returned log group; if a proper log group cannot be found, a new log group is created according to the current log message, wherein the log ID only comprises the ID of the log message, and the log event is the log message; then, the Drain log parser updates the parse tree with the new log group.

Preferably, in step 102, the step size Δt of the window used by the sliding window method is set to be smaller than the window size Δt, where Δt represents the time interval included in each window, and Δt represents the time interval of each sliding of the window.

The invention provides an unsupervised log anomaly detection scheme combining an LSTM and a variation self-encoder, which can analyze an unstructured log into a log template only containing key information, divide the template into event sequences, and input the event sequences into a model combining the LSTM and a VAE for anomaly detection.

The method effectively solves some problems still faced in the current log-based anomaly detection algorithm. The method adopts a Drain log analysis method to analyze original log information. It can parse unstructured log data into log templates that contain only critical information. Dividing the template sequence by using a sliding window algorithm, and using the obtained event sequence for inputting an anomaly detection model. And an unsupervised anomaly detection algorithm is adopted, so that the problem of rare marked data can be solved, the anomaly detection process is simplified, and the detection speed is improved. The POT algorithm is used for calculating the threshold value in the aspect of threshold value selection, so that the detection result is prevented from being interfered by manually setting the threshold value.

In summary, the main benefits and contributions of the present invention can be summarized as follows:

1. the Drain algorithm and the sliding window algorithm are used for processing the original log data, so that the processing efficiency, accuracy and instantaneity of the log data can be improved

2. The LSTM-VAE model is used for anomaly detection, so that sequence characteristics can be effectively captured, an unsupervised learning model is realized, an marked anomaly training sample is not needed, and the model implementation cost is greatly reduced.

3. Learning the threshold using the POT algorithm may provide better adaptivity, flexibility, and accuracy, better adaptation to data changes.

4. The online abnormality detection has the advantages of real-time performance, early abnormality discovery, sustainable monitoring, storage space saving, high efficiency and the like, and is suitable for scenes needing to respond to abnormal conditions in time.

Drawings

FIG. 1 illustrates an anomaly detection workflow;

FIG. 2 illustrates an anomaly detection model framework;

FIG. 3 illustrates the parse tree structure in Drain (depth 3);

FIG. 4 illustrates a log parsing summary;

FIG. 5 illustrates a sliding window;

FIG. 6 illustrates an LSTM-VAE model framework;

FIG. 7 is a block diagram of LSTM;

figure 8 illustrates a VAE model structure.

Detailed Description

The invention will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. Further, it is understood that various changes and modifications may be made by those skilled in the art after reading the teachings of the present invention, and such equivalents are intended to fall within the scope of the claims appended hereto.

The whole flow of the invention can be monitored on line in real time based on log flow, and the whole framework is shown in figure 2 and is divided into two parts of off-line training and on-line detection. In the off-line stage, the log data is analyzed into a log template sequence through a preprocessing stage, the log template sequence is further divided into a log sequence matrix by utilizing a sliding window algorithm, a training model in a model is input, a reconstruction error is calculated, and the obtained reconstruction error is input into a POT model to automatically calculate a threshold value. In the online detection stage, real-time log data is input into the model in a streaming manner, and abnormality is detected by using the threshold value calculated in the offline stage, and the abnormality is reported to an administrator. A detailed description of the individual modules is provided below.

(1) Offline training module

Step one: log parsing

The main task of log parsing is to convert unstructured raw logs into structured log templates. The invention adopts a Drain log parser to parse the log. The Drain log parser may automatically extract log templates from original log messages and split them into disjoint log groups without requiring source code or any messages other than the original log messages. The Drain log parser uses parse trees of fixed depth to guide the log group search process, effectively avoiding constructing a very deep, unbalanced tree. As shown in fig. 3, the depth of the leaf node is fixed to 3, so that the access times of the node Drain in the searching process can be limited, and the original log message can be accurately and efficiently resolved in a streaming manner, and the online resolving capability is provided.

The Drain log parser (hereinafter referred to as "Drain") specifically processes logs using the following steps:

(1) preprocessing according to domain knowledge: drain allows users to provide simple regular expressions based on domain knowledge that represents common variables such as IP address and block ID. Drain will then delete the tags that these regular expressions match from the original log message.

(2) Searching according to the length of the log information: drain processes log data from the root node of the parse tree and uses the preprocessed log information. In the first level nodes of the parse tree, log groups of different lengths are represented. Drain selects a path to traverse to a corresponding node based on the length of the log message. For example, for log message "Receive from node 4", drain traverses to the internal node "Length:4". This is based on the assumption that log messages with the same log event may have the same log message length.

(3) Search by previous labels (tokens): drain traverses from the first level node searched in step (2) to the leaf node. (this step is based on the assumption that the token at the beginning of the log message is more likely to be constant).

Drain selects the next internal node by token in the log message start position. For example, for the log message received for node length=4 in fig. 3, drain traverses from layer 1 node length=4 to layer 2 node received because the token in the first position of the log message is received. In this example, the number of internal nodes is depth-2, so there are depth-2 layers that treat the first depth-2 token of the log message as a search rule.

In some cases, the log message may start with a parameter, e.g., "120bytes received". Such log messages may cause branch explosions in the parse tree, as each parameter will be encoded in one internal node. To avoid branch explosions, only token that does not include a number is considered at this step.

If a token contains a number, it will match a particular internal node "×". For example, for the log message "120bytes received" above, drain will traverse to the internal node instead of 120.

(4) Search by token (token) similarity: prior to this step, drain has traversed a leaf node containing a list of log groups. The log messages in these log groups follow rules encoded in the nodes inside the path. For example, one log message in the log group in fig. 3 contains 4 flags and starts with received. In this step, drain will select the most appropriate log group from the log group list.

(5) Updating the parse tree: if an appropriate log group is returned in step (4), drain adds the log ID of the current log message to the log ID in the returned log group. In addition, updates are returned to the log events in the log group. If a proper log group cannot be found, a new log group is created according to the current log message, wherein the log ID only contains the ID of the log message, and the log event is the log message. Drain will then update the parse tree with the new log group.

Step two: event sequence partitioning

After log parsing, the original log is parsed into an event sequence, as shown in FIG. 4. The sequence of events is further divided using a sliding window approach, as shown in fig. 5. The sliding window method is a sequence dividing method based on a time stamp, and the time stamp records the time generated by each log. The window consists of two attributes: window size Δt and step size Δt, Δt representing the time interval each window contains, Δt representing the time interval each window slides, as shown, window slides backward for a step size period. In general, the step length is smaller than the window size, so that overlapping among different windows can be formed each time the window slides, the divided log sequence can be increased, and errors caused by uneven window coverage can be reduced.

Step three: LSTM-VAE anomaly detection model training

In order to solve the problems that a large amount of marked data is needed and the concern of long-term and short-term dependency relationship in the data is lacking in the training process of the conventional anomaly detection method, an LSTM-VAE network is designed by fusing the LSTM and the VAE, as shown in figure 6. In the network, the LSTM neurons are used for replacing neurons in the encoding layer and the decoding layer of the VAE, namely, LSTM is used for extracting long-term and short-term dependency relations in input data, and log data are modeled through variational reasoning of the VAE. This step uses normal log data for model training.

The LSTM-VAE model used in the present invention combines the advantages of the LSTM model and the VAE model. The LSTM network is a cyclic neural network, and can effectively capture long-term dependency and dynamic change models between input sequences; VAE is a generative model consisting of an encoder and a decoder. The encoder maps the input data into a hidden variable distribution in a hidden space, and the decoder remaps the hidden variables to reconstruct the original data. When an abnormality occurs, the generated sample cannot be matched with the real sample, so that the sample can be used as a basis for abnormality detection, and the detection accuracy is improved.

LSTM model:

the LSTM (Long Short Term Memory long and short term memory network) is characterized in that an input gate, a forgetting gate and an output gate are added on the basis of an RNN, and the gate control units are used for controlling information in the transmission process, so that the information can capture a larger range of dependency relationship, and the problem that the RNN cannot memorize for a long time is solved.

The LSTM is structured as shown in FIG. 7, i _t ，f _t ，o _t Is an input gate, a forget gate and an output gate of the LSTM. The forget gate determines how much unit state at the last moment needs to be reserved at the current moment; the input gate decides how much input data of the network needs to be stored in the unit state at the current moment; the output gate determines the next hidden layer state.Candidate memory cells that are LSTM, C _t Memory cell which is LSTM cell, H _t Is a hidden unit of LSTM, X _t Input representing time t,/->Representing matrix by element ++>The representation matrices are added by element.

The formulas required for the workflow of LSTM are as follows:

i _t ＝σ(W _i [H _t-1 ,X _t ]+b _i )

f _t ＝σ(W _f [H _t-1 ,X _t ]+b _f )

o _t ＝σ(W _o [H _t-1 ,X _t ]+b _o )

H _t ＝o _t *tanh(C _t )

wherein sigma and tanh functions are activation functions, W _i 、W _c 、W _f 、W _o And b _i 、b _c 、b _f 、b _o Representing the corresponding weight matrix and bias, respectively.

VAE model:

VAE (Variational AutoEncoder, variational self-encoder) belongs to an algorithm in unsupervised learning, whose goal is to construct an implicit variable Z to generate target data X, while assuming that Z follows a common distribution such as normal distribution or uniform distribution. It is ultimately desirable to train a model, i.e., the original distribution can be mapped out by implicit variables, and the VAE design is shown in fig. 8.

The VAE continuously trains input data through the encoder to generate the mean mu and standard deviation sigma of the hidden variable Z, the hidden variable Z is not directly generated from the encoder, but is generated through standard normal distribution random sampling parameter epsilon and then resampling, such as a formula, so that the problem of reverse gradient fracture in neural network training is avoided.

Z＝μ+ε*σ

The loss function (loss) of the VAE model consists of two parts, one part of which is used for describing the similarity of the distribution of hidden variables and the standard normal distribution, and is expressed by using KL divergence; the other part is used to reconstruct the error, describing the degree of difference between the reconstructed data and the original data. The formula is defined as:

wherein q _φ (z|x ⁽ⁱ⁾ ) Is the posterior distribution of hidden variables, p _θ (x|z) is the posterior distribution of the decoder output, p _θ (z) is the prior probability distribution of the hidden variables.

Step four: obtaining reconstruction error E

After the model is trained, a reconstruction sequence of the input data is obtained, and a reconstruction error of the data can be obtained by calculating the difference between the reconstruction sequence and the original sequence.

Step five: acquisition threshold alpha

The present invention uses extremum theory (EVT) to automatically set the threshold. EVT is a statistical theory whose goal is to find the law of extreme events, which is considered to be different from the distribution of the entire data. Its main advantage is that no assumptions need be made about the data distribution when finding extrema. POT (Peaks-Over-Threshold) is the second theorem of extremum theory, the basic idea of which is to fit the tail of the probability distribution by a generalized Pareto distribution (Generalized Pareto Distribution, GPD) with parameters. The present invention employs a POT method to learn the threshold value of the reconstruction error. The reconstruction error sequence obtained in the fourth step is used as an input of the fourth step, and the threshold alpha is automatically calculated by a POT method.

The POT algorithm can dynamically adjust the threshold according to the actual situation. The traditional fixed threshold value is not suitable for different data distribution and different time periods, and the POT algorithm can adaptively adjust the threshold value according to the peak value condition of the data, so that the change of the data is better adapted.

Extremum theory (EVT) is a statistical theory whose goal is to find the law of extreme events, which is considered to be different from the distribution of the whole data. Its main advantage is that no assumptions need be made about the data distribution when finding extrema. POT (Peaks-Over-Threshold) is the second theorem of extremum theory, the basic idea of which is to fit the tail of the probability distribution by a generalized Pareto distribution (Generalized Pareto Distribution, GPD) with parameters. The formula is as follows:

wherein t is an initial threshold, exceeding the threshold t is denoted as X-t, γ, σ is a parameter of GPD

Estimating values of gamma and sigma using Maximum Likelihood Estimation (MLE), and obtaining estimated valuesAnd->The quantile threshold Zq may then be calculated using the following formula:

wherein Y is _i >0 is X _i A portion exceeding t (Y _i ＝X _i -t，X _i >t). q is the desired probability, N is the total number of observations, N _t Is the number of peaks, i.e. X _i Is a number of (3).

(2) Online anomaly detection module

Step one: online data preprocessing

And sending the log data generated in real time to Drain for analysis through a corresponding log collector. And a sliding window algorithm is used for event sequence division.

Step two: obtaining reconstruction error E

The event sequence obtained from step one is input into the already trained LSTM-VAE model, and since the model is trained from normal data, normal samples can be better reconstructed by the VAE model and less reconstruction errors occur. The abnormal samples are inconsistent with the data distribution and cannot be accurately reconstructed, so that larger reconstruction errors are generated.

Step three: abnormality detection

Comparing the threshold value alpha learned in the off-line stage with the reconstruction error obtained in the step two, and considering that the abnormality is detected when the reconstruction error is larger than the threshold value alpha.

Claims

1. The anomaly detection method based on the system log data is characterized by comprising an offline training stage and an online detection stage, wherein:

the offline training phase comprises the following steps:

the online detection stage comprises the following steps:

2. The anomaly detection method based on system log data according to claim 1, wherein in step 101, the Drain log parser extracts log templates from original log messages and splits the log templates into disjoint log groups, wherein the Drain log parser uses a parse tree of fixed depth to guide the log group search process.

3. The anomaly detection method based on system log data according to claim 1, wherein the step 101 further comprises the steps of:

(1) preprocessing according to domain knowledge:

(2) searching according to the length of the log information:

(3) searching according to the previous mark:

(4) searching by labeled similarity:

(5) updating the parse tree:

4. The abnormality detection method based on system log data according to claim 1, wherein in step 102, a step size Δt of a window used by the sliding window method is set to be smaller than a window size Δt, wherein Δt represents a time interval included in each window, and Δt represents a time interval of each sliding of the window.