CN113434357B

CN113434357B - Log anomaly detection method and device based on sequence prediction

Info

Publication number: CN113434357B
Application number: CN202110534643.4A
Authority: CN
Inventors: 周江; 宿林; 李波; 王伟平
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2023-04-11
Anticipated expiration: 2041-05-17
Also published as: CN113434357A

Abstract

The invention discloses a log anomaly detection method and device based on sequence prediction, which comprises the following steps: analyzing the log sequence to be detected to obtain a log key sequence to be detected; and inputting the log key sequence to be detected into a log anomaly detection model to obtain a log anomaly detection result. According to the invention, the embedded vector of the log key is constructed through the semantic-based preprocessing module, and the characteristics of the target information are learned from the surrounding sequence, so that more semantic features can be reserved, the method is suitable for platform log data with more rules and large scale, higher detection precision is obtained, and the training efficiency of the model is improved.

Description

Log anomaly detection method and device based on sequence prediction

Technical Field

The invention relates to the field of computer software, in particular to a log anomaly detection method and device based on sequence prediction.

Background

With the rapid development of internet technology, the scale and complexity of modern systems are increasing continuously, and the operation and maintenance data of platforms are increasing rapidly, so that a manual detection mode becomes infeasible. These large data platforms or systems often provide online services, and once an attack or failure occurs, the applications may crash and huge economic losses may be caused. Therefore, valuable information in the massive operation and maintenance data is fully utilized to analyze and timely find abnormal conditions in the system, and the method has important significance. The existing log-based anomaly detection method mainly comprises the following stages: log collection, log analysis, feature extraction and anomaly detection.

Traditional log anomaly detection relies heavily on manual work, and developers manually examine system logs or write rules to detect anomalies based on their domain knowledge. With the continuous development of machine learning technology, a log auditing method based on machine learning is widely researched. For example, chinese patent (application No. CN201910698395.X, publication No. CN 110381079A) uses principal component analysis to perform dimensionality reduction on log data; then using the processed training data set to train a GRU-based classifier model; and finally, inputting the actual log to be detected into a GRU-SVDD comparator to detect the abnormity in the log.

The log data can be parsed into two parts, fixed and variable, called log keys and corresponding parameters. The traditional detection mode based on machine learning cannot learn the association between log keys and log parameters, in fact, the association exists in a large quantity on the log data of a plurality of platforms, and some research works try to realize log anomaly detection by mining the relationship between the log parameters. For example, chinese patent (application No. CN202010880971.5, publication No. CN 112069787A) parses all the parameters in the log, converts discrete parameters into continuous parameter word vectors, trains the parameter word vectors using a long-short term memory neural network model, and predicts the parameter word vectors at subsequent target times using the trained parameter word vectors. And in the detection stage, calculating cosine similarity of the prediction parameter and the target parameter, wherein if the cosine similarity is lower than a threshold value, the log parameter abnormality is detected.

Because the conventional log anomaly detection system in the industrial field relies on manpower to a large extent, in practical production application, each developer is only responsible for a certain module due to the complexity of a platform, and it is difficult to find anomalies from mass log data.

The log abnormity detection based on the LSTM obtains good effect on automatic log audit by means of the expression capability of the recurrent neural network on processing sequence problems. The key to this type of approach is sequence prediction, i.e., predicting the log of the current time step from the previous information of the target, but this approach ignores the dependency of the target on the next information. In addition, the method only focuses on the learning of log sequence relations, and does not consider a large amount of correlations existing among log keys, so that the similar relations of the log keys are not sufficiently mined, and the model migration capability is greatly limited, for example, the log data with complex rules is not well-behaved.

Disclosure of Invention

Aiming at the defects of the existing log anomaly detection method, the invention provides a log anomaly detection method and device based on sequence prediction. According to the method, the dependency relationship between the target log and the surrounding context is learned, the information of the front and the back can be considered at the same time, and the sequence mining is more sufficient; through a semantic-based preprocessing module, log keys are converted into dense embedded vectors, and the similarity relation between the templates is fully learned, so that the method is better suitable for complex data sets. Meanwhile, the method provides an attention-based mechanism, so that the time overhead caused by the sequential propulsion of the recurrent neural network can be reduced to a certain extent, and the running efficiency of the model is improved, which is very critical for a detection system which needs to deal with mass log data in actual production.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

a log anomaly detection method based on sequence prediction comprises the following steps:

1) Analyzing the log sequence to be detected to obtain a log key sequence to be detected;

2) Inputting a log key sequence to be detected into a log anomaly detection model to obtain a log anomaly detection result;

the log anomaly detection model is obtained through the following steps:

a) Analyzing a plurality of normal log data sequences to obtain a plurality of normal log key sequences;

b) Using a first sliding window for each normal log key sequence to obtain a plurality of training samples of each normal log key sequence;

c) Based on a preprocessing language model, obtaining word vectors of all words in each normal log key sequence, and obtaining a vector sequence of each normal log key sequence according to the word vectors and the occurrence frequency of each word in a corresponding training sample;

d) Using a second sliding window for each vector sequence and covering a position to be predicted in the window, and obtaining a plurality of position coding vectors through the positions of elements in the window;

e) And coding each position coding vector, acquiring a predicted value of a corresponding position to be predicted according to a coding result, and performing error feedback and parameter updating according to a difference value between the predicted value and a true value to obtain a log anomaly detection model.

Further, the method for obtaining the log key sequence to be detected comprises the following steps: drain method.

Further, a number of training samples for each normal log key sequence are obtained by:

1) Setting the length m of the front text and the length n of the back text, wherein m is more than or equal to 0, n is more than or equal to 0, m + n = P, P is the length of the first sliding window;

2) And segmenting each normal log key sequence by using the set length m of the front part and the set length n of the back part to obtain a plurality of training samples of each normal log key sequence.

Further, preprocessing the language model includes: word2vec model.

Further, a vector sequence of each normal log key sequence is obtained by the following steps:

1) Obtaining sentence vector representation of each log key according to the word vector and the frequency of each word in the corresponding training sample;

2) And combining the sequence of the log keys in the normal log key sequence to obtain the vector sequence of each normal log key sequence.

Further, the method for covering the position to be predicted comprises the following steps: and (4) a mask mechanism.

Further, the position-coding vector is obtained by the following strategy:

1) When the dimension represented by the sentence vector is even, the encoding result

Where pos is the position of an element in the vector sequence, i is the dimension sequence number of the corresponding vector of the log key, d _model Representing the dimensionality of the sentence vector representation;

2) When the dimension of the position log key is odd, the result is encoded

Further, when encoding each position encoding vector, the dependency relationship between each position encoding result is learned by using an attention mechanism.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.

The invention has the beneficial effects that:

1. according to the invention, the embedded vector of the log key is constructed through the preprocessing module based on the semantics, so that on one hand, the semantic association among the log key categories can be fully considered, and more semantic features are reserved by virtue of strong expression capability of deep learning; on the other hand, the data dimension can be reduced, the overhead of model operation is obviously reduced, and the method is more suitable for platform log data with multiple rules and large scale.

2. Aiming at the condition that the existing abnormal detection model based on the recurrent neural network is insufficient in sequence relation mining, the invention provides a model based on the sequence prediction before and after the log, the dependency relation between the target and the surrounding sequence is further mined by utilizing the characteristic that Transfbrmer can learn target information from the surrounding sequence, and the model obtains higher detection precision on a plurality of data sets. Meanwhile, the attention mechanism has better parallel computing capability, and the training efficiency of the model is also obviously improved.

Drawings

FIG. 1 is an overall flow chart of the present invention, which includes four stages of initializing training data, preprocessing data, training model, and detecting.

FIG. 2 is a model framework of the present invention, including two stages of training and testing.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a log anomaly detection method based on context sequence prediction. The method provides an anomaly detection model aiming at real-time log streams, and the anomaly detection model mainly comprises a multi-source log collection module, an analysis module, a preprocessing module and an anomaly detection module. The log data is collected and stored by an open-source log analysis framework, and then the unstructured log data is converted into structured data of a fixed part and a variable part by adopting an automatic log analysis strategy, wherein the fixed part is called a log key, and the variable part is a parameter. In most platform log data, the size of the log key is limited, so that there are many methods for directly vectorizing the log key by onehot (one-hot coding). The output of the log follows a certain logic sequence, namely the execution path of the log, and in order to excavate the sequence relation of normal log streams through the neural network, a log key sequence replaces the log sequence, and the log key is vectorized and then input into the network. The network part consists of a multi-layer encoder of a Transformer, and the encoded sequence is sent into a softmax network to obtain an n-dimensional vector, wherein the value of each dimension represents the probability of a log key in a template dictionary appearing at the current position. On the basis, an embedding layer is added before data is input into a network, onehot vectors corresponding to log key indexes are converted into sentence vectors for representation, the characterization capability of the input vectors is enhanced, and the semantic features of the log keys are maintained. In addition, since the model introduces the post-sequence of the target, which may have a certain influence on the real-time performance of the detection, we propose a trade-off scheme: when higher detection precision is needed, the input length after the next period is properly increased; when the requirement on the detection timeliness is higher, the target later input can be properly reduced or even cancelled.

According to the design scheme provided by the invention, a log anomaly detection method based on sequence prediction is shown in fig. 1 and specifically comprises the following steps:

step 1. Log data collection

The sample is provided with a label, wherein the log data of the training set are logs under normal conditions, and the log data in the testing set is abnormal.

Step 2, log analysis

The log data itself is unstructured text, and the purpose of log parsing is to extract structured information such as users, timestamps, permissions, file names, etc. from these unstructured data. Currently, there are many analysis strategies in the industry, and a relatively common Drain method is adopted to obtain a required log key sequence from a log sequence.

The log parsing strategy Drain is based on a fixed depth tree structure, and considers logs with the same length, and business meanings of the logs have similarity, so that the length is an important criterion for log key extraction.

Step 3. Model training

The log anomaly detection of the invention is shown in FIG. 2 and comprises an embedding layer, a position coding layer, a transform coding layer, a feedforward neural network and a softmax network.

3.1 embedding layer

In order to keep semantic relevance of the log keys, a template sequence is used as a training sample, word vectors of all words in the log keys are obtained through word2vec, and sentence vector representation of each log key is obtained through weighting and averaging according to different frequencies of the words appearing in the training sample.

For example, for a segment of log key sequence { K2, K3, K5, K7, K2}, a window with a size of 3 is taken to obtain two training samples { K2, K3, K5} → { K7} and { K3, K5, K7} → { K2} →, and then a training sample is obtained for each sliding of the window. And combining all the training samples to obtain a training set.

3.2 position-coding layer

And acquiring a vector sequence corresponding to the log stream according to the expression of the sentence vector. However, because the length of the vector sequence corresponding to the log stream is large, the vector sequence needs to be segmented in a sliding window mode, the window size is h, the target preamble length is Li-, and the postamble length is Li +. Covering the positions to be predicted in the window by using a mask mechanism, carrying out position coding on elements of all positions in the window, and using the obtained position coding vector as an input part of a transform coding layer. The window size h and the lengths Li & Li + of the preceding and following texts can be adjusted, and the window size h and the lengths Li & Li + of the preceding and following texts can be adjusted correspondingly according to the requirements of accuracy and detection timeliness in practical application.

Specifically, the input part needs to be position-coded, and the corresponding calculation formula is as follows:

wherein pos refers to the position of an element in a vector sequence and has a value range of [0]I refers to the dimension serial number of the vector corresponding to the log key, the numeric area is [0,embedding dimension ], d _model Represents the dimensionality of the sentence vector representation: when the dimensionality represented by the sentence vector is odd, calculating by adopting a cos function; and in the case of even numbers, calculating by adopting a sin function. Each position generates unique texture position information, and the model learns the dependency relationship and the time sequence characteristic between the positions.

3.3Transformer coding layer and softmax network

And inputting the position coding vector into a Transformer coding layer, and sending the coded sequence into a softmax network to obtain an n-dimensional vector, wherein the value of each dimension represents the probability of a template in a template dictionary appearing at the current position. And (4) returning the error of the difference value between the predicted value and the true value, updating the parameters in the model, and finally obtaining the trained network. And obtaining and storing the optimal model until the network is converged and the retrieval effect on the test set reaches the optimal. Wherein the template dictionary is obtained by parsing on the log training data.

Specifically, with the use of the dependency between the attention mechanism learning sequences, the attention calculation formula is:

wherein Q, K, V are positionsAnd multiplying the coded vector by three initial weight matrixes WQ, WK and WV respectively, and determining the three weight matrixes in the training process. Multiplying the Q and K matrix and dividing by the dimension d of the K matrix to prevent the result from being too large _k The square root of the attention mechanism is multiplied by the V matrix to obtain the output result of the attention mechanism.

And 4, model prediction. And (4) performing sequence prediction on the real-time log stream by using the optimal model obtained in the step (3), similarly performing structured analysis and window segmentation on the log data to obtain a vectorized sequence, and inputting the vectorized sequence into a network to obtain the probability distribution of the log key at each position.

And 5, judging the abnormity. Based on the probability distribution result obtained in step 4, it is determined whether the log key of the actual log data to be subjected to the abnormality detection is within the confidence interval (the first N log keys with the highest probability) predicted for the position, and if so, the determination result is normal, and if not, the determination result is abnormal.

The probability distribution result obtained by model prediction is specific to all log keys obtained after analysis, namely each log key has corresponding probability and is sorted according to the probability, and the confidence interval is set artificially to serve as the standard of abnormity judgment, so that the confidence interval can be adjusted according to the accuracy and recall rate of model detection in the training process.

In summary, the preprocessing method for the log key and the log anomaly detection model based on the target preceding and following sequence prediction provided by the invention can fully mine and utilize template semantics and log sequence context information, efficiently detect anomalous data in the log stream through training, and simultaneously remarkably improve the model detection efficiency.

Experimental data

The HDFS dataset, which is HDFS log data generated by amazon EC2 nodes during 38h of operation, contains 11, 175, 629 pieces of log data, where abnormal data is marked by Hadoop domain experts, accounting for approximately 3%.

Model test result comparison (HDFS log)

Method	Precision	Recall	F1
				Principal component analysis method (PCA)	0.964	0.645	0.772
Log clustering (cluster)	0.852	0.761	0.791
				Our model	0.863	0.831	0.846

The above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and a person skilled in the art may make modifications or equivalent substitutions to the technical solutions of the present invention without departing from the scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A log anomaly detection method based on sequence prediction comprises the following steps:

the log anomaly detection model is obtained through the following steps:

a) Analyzing a plurality of normal log data sequences to obtain a plurality of normal log key sequences; analyzing the plurality of normal log data sequences to obtain a plurality of normal log key sequences, wherein the analyzing comprises:

setting an earlier length m and a later length n, wherein m is more than or equal to 0, n is more than or equal to 0, m + n = P, P is the length of the first sliding window;

segmenting each normal log key sequence by using the set preceding length m and the following length n to obtain a plurality of training samples of each normal log key sequence;

d) Using a second sliding window for each vector sequence, covering a position to be predicted in the second sliding window, and obtaining a plurality of position coding vectors through the positions of elements in the second sliding window;

2. The method of claim 1, wherein obtaining the log key sequence to be detected comprises: drain method.

3. The method of claim 1, wherein pre-processing the language model comprises: word2vec model.

4. The method of claim 1, wherein the vector sequence for each normal log key sequence is obtained by:

5. The method of claim 1, wherein the method of masking the location to be predicted comprises: and (4) a mask mechanism.

6. The method of claim 1, wherein the position-coding vector is obtained by:

1) When the dimensionality represented by the sentence vector is even, the encoding result

Where pos is the position of an element in the vector sequence, i is the dimension number of the vector corresponding to the log key, and d _model Representing the dimensionality of the sentence vector representation;

2) When the dimension of the position log key is odd, the result is encoded

7. The method of claim 1, wherein in encoding each position-coding vector, a mechanism of attention is used to learn a dependency relationship between each position-coding result.

8. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-7.

9. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-7.