CN111209168A

CN111209168A - Log sequence anomaly detection framework based on nLSTM-self attention

Info

Publication number: CN111209168A
Application number: CN202010037427.4A
Authority: CN
Inventors: 钱叶魁; 杨瑞朋; 雒朝峰; 黄浩; 李宇翀; 宋彬杰; 杜江
Original assignee: People's Liberation Army Army Artillery Air Defense Academy Zhengzhou Campus
Current assignee: People's Liberation Army Army Artillery Air Defense Academy Zhengzhou Campus
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2020-05-29

Abstract

The invention relates to a log sequence anomaly detection framework based on nLSTM-self attribute, which comprises the following steps: training a model and an anomaly detection model; the training model comprises the following steps: suppose that one log file contains k log templates E ═ { E ═ E₁,e₂L e_kThe input of the training model is a sequence of log templates, a log sequence l with the length of h_t‑h,…l_t‑2,l_t‑1The log template l contained in_iE, t-h is more than or equal to i is less than or equal to t-1, and the number | l of log templates in a sequence_t‑h,...l_t‑2,l_t‑1I is equal to or less than h, each log template corresponds to a template number, a log template dictionary is generated, then a normal log template sequence is generated into an input sequence and target data are fed into an abnormity detection model for training(ii) a The detection stage comprises: the data input method is the same as the training stage, the model generated in the training stage is used for carrying out anomaly detection, and the model output is a probability vector P ═ (P)₁，p₂L p_k)，p_iRepresenting the target Log template as e_iIf the actual target data is in the predicted value, the log sequence is judged to be normal, otherwise, the log sequence is judged to be abnormal.

Description

Log sequence anomaly detection framework based on nLSTM-self attention

Technical Field

The invention relates to a network security technology, in particular to a log sequence anomaly detection framework based on nLSTM-self attention.

Background

The network environment is increasingly complex, attacks aiming at network application and systems are continuously emerging, and the attack is often combined and applied by various attack means, so that the existing anomaly detection method is not suitable for novel attacks any more. Once the attack is successful or the network application itself is abnormal, immeasurable loss is brought to the owner and the user of the application. The earlier attacks and false detections, the less loss will be incurred.

The network, the system and the application program can generate logs in the running process for recording the running state and important events, so that the logs contain extremely rich dynamic information, and log analysis is important for maintaining various tasks. These tasks include security tasks such as intrusion detection, internal threat detection and malware detection, and more common maintenance tasks such as detecting hardware failures and the like. By analyzing the log, we can detect abnormal behavior and mine potential security threats.

In recent years, the research of using logs as an abnormality detection data source has received more and more attention, and most of the earliest log abnormality detection methods are manual operation and rule-based methods, but as online service systems become larger, the existing methods are obviously inefficient and require manual inspection of a huge number of logs. With the development of machine learning, many researches adopt feature engineering, and various clustering methods are used for discovering abnormal points or abnormal sequence patterns for abnormality detection. Liuzhaoli et al propose an integration method using K-prototype clustering and a K-NN classification algorithm that analyzes the characteristics of system logs and applies the K-prototype clustering algorithm to divide a data set into different clusters. Obvious normal events, which typically appear as highly coherent clusters, are then filtered out, and other events are considered as anomalous candidates for further analysis. However, the clustering-based analysis method is based on the assumption that the abnormal log is an event which happens in the log file, and this assumption is not always true. He S summarizes and compares several more advanced machine learning methods for log anomaly detection in recent years. The existing machine learning algorithm firstly analyzes an original log into log templates, groups the log templates according to different grouping strategies, wherein each group corresponds to one line, and then extracts the characteristics of the grouped sequences. The feature extraction method only considers the occurrence frequency and does not consider the time sequence relation of the log, only can finish the abnormal detection, but cannot find the abnormal position, which is very unfavorable for the positioning and diagnosis of the abnormality in the later period. The clustering method has a precondition assumption that the abnormal log is a log which happens in the log. This assumption is reasonable in most cases, but there are exceptions. Because the log which occurs by chance is not necessarily abnormal, and there is a log in which abnormal information is mainly recorded according to the setting of a specific application system. In this case, the abnormality log is not a log which occurs by chance. A large number of abnormal detection log mining methods are designed for different applications, Beehive and the like perform unsupervised clustering by researching the log characteristics of network equipment and utilizing the characteristics to identify potential security threats and then manually mark abnormal logs; oprea et al use belief propagation to detect early enterprise threats to DNS logs; the PerfAugur et al system discovers anomalies in system performance by mining the characteristics of the server log. Bovenzi et al propose an operating system level anomaly detection method that is very effective for mission critical systems. Venkatakrishnan et al propose a diversity-based approach to security anomaly detection to prevent system attacks. Zhonchang et al propose a new method for analyzing DNS query behavior, respectively map the queried domain name and the host requesting query to vector space by using deep learning mechanism, apply correlation analysis and clustering, find abnormal problems in network such as botnet, etc. This method, while accurate, is only suitable for detecting anomalies in certain types of logs, is limited to specific scenario applications and requires domain experts.

The log is sequence data, and the appearance sequence of the log has a certain dependence relationship, and sometimes the dependence length is longer. For example, some new attacks are implemented, and they do not immediately damage, but reach some prerequisites, or do some normal operations before damage occurs, which is reflected in the relationship of long-term dependence in the log sequence. Routine sequence-based anomaly detection methods are also widely studied. Such methods were earlier mostly implemented based on statistical models or markov models. Although statistical model-based methods are easy to implement, their accuracy can be low if their pre-assumptions do not hold. Among the markov model-based methods, the most basic method is to model a log event sequence by using a markov chain, i.e., estimating the corresponding probability by using the frequency of events and the transition frequency, and then judging whether the test log is abnormal by calculating the occurrence probability of the test log; ye N proposes a detection method based on a first order markov chain. Although the use of higher order markov chains can improve the model description capability, the number of model parameters grows exponentially with the increase of the order, thus requiring more training logs and larger memory; bao L et al treat the traces as sequence data and use a probabilistic suffix tree based approach to organize and distinguish important statistical properties possessed by sequences. The recurrent neural network has better processing capacity for sequence data, in recent years, LSTM has good effect in sequence prediction, and researchers apply the model to a log sequence prediction task. Zhang et al uses clustering technique to generate feature sequence for original log text from multiple log sources, and inputs the feature sequence into LSTM model for hardware and software failure prediction; du Min et al is inspired by NLP to parse the original text of the system log, generate a log template sequence and input LSTM to detect denial of service attacks. One-hot coding input is adopted, in an anomaly detection part, a training and detection framework based on 2-layer stacking LSTM (2 LSTM is used for representing 2-layer stacking LSTM), although the accuracy of some data sets is greatly improved compared with that of a machine learning method, the framework has the problem of insufficient performance characteristic capability of one-hot and insufficient processing capability of LSTM on a long sequence, and does not have a good effect on all data sets. These methods are simple searches for log sequence abnormality detection using LSTM, and further improvement in detection accuracy is required.

Disclosure of Invention

The invention aims to provide a log sequence anomaly detection framework based on nLSTM-self attribute, which is used for solving the problems of the prior art.

The invention relates to a log sequence anomaly detection framework based on nLSTM-self attribute, which is characterized by comprising the following steps: training a model and an anomaly detection model; the training model comprises the following steps: suppose that one log file contains k log templates E ═ { E ═ E₁,e₂L e_kThe input of the training model is a sequence of log templates, a log sequence l with the length of h_t-h,…l_t-2,l_t-1The log template l contained in_iE, t-h is more than or equal to i is less than or equal to t-1, and the number | l of log templates in a sequence_t-h,...l_t-2,l_t-1If m is less than or equal to h, each log template corresponds to a template number, a log template dictionary is generated, and then a normal log template sequence is generated into an input sequence and target data are fed into an abnormal detection model for training; the detection stage comprises: the data input method is the same as the training stage, the model generated in the training stage is used for carrying out anomaly detection, and the model output is a probability vector P ═ (P)₁,p₂Lp_k)，p_iRepresenting the target Log template as e_iIf the actual target data is in the predicted value, the log sequence is judged to be normal, otherwise, the log sequence is judged to be abnormal.

According to an embodiment of the nLSTM-self attribute-based log sequence anomaly detection framework, a trained loss function is cross entropy, and an adaptive gradient descent method is adopted for optimization of the loss function.

According to an embodiment of the nLSTM-self attribute-based log sequence anomaly detection framework of the present invention, a log file contains a plurality of event types, each event type contains a plurality of logs, the logs belonging to the same event type have a common template, and a log sequence is used as a sequence of events that occur, that is, a sequence of log templates corresponding to an original log sequence. And detecting the abnormity of the log template sequence corresponding to the original log sequence.

According to an embodiment of the nLSTM-self attribute-based log sequence anomaly detection framework of the present invention, the anomaly detection model comprises: the word embedding layer, the n layers of long-time memory neural network layer and the self-attention layer; the word embedding layer takes the log template sequence as input and is used for the front-end input of the anomaly detection framework, and the serial number of each log template in the sequence is mapped into dense word embedding; the nLSTM layer takes the distributed word embedding of each log template obtained by the word embedding layer as input; the self-attention layer firstly calculates the dependency relationship among logs in a sequence, takes the hidden states of all long-time memory neural network units at the top layer as the input of the self-attention layer, performs similarity calculation, and then performs normalization processing as the probability weight of the self-attention value: the weighted summation of the outputs of the n layers of long-time memory neural network layers is the result of the self-attention value.

According to an embodiment of the nLSTM-self attribute-based log sequence anomaly detection framework of the present invention, an LSTM unit comprises: x is the number of_tWord embedding representing a Log template, C_tIndicating the cellular status of the t-th LSTM unit in the current sequence, h_tRepresenting the hidden state of the t-th LSTM cell in the current sequence, the sigma module represents a sigmoid function, tanh represents a tanh function,

indicating dot product, ⊕ indicating addition, hidden output h of an LSTM cell_t：

f_t＝σ(W_f·[h_t-1,x_t]+b_f) (1)

i_t＝σ(W_i·[h_t-1,x_t]+b_i) (2)

o_t＝σ(W_o·[h_t-1,x_t]+b_o) (5)

h_t＝o_t*tanh(C_t) (6)

Formula (1) represents a forgetting gate that determines what information to discard from the cell state; equations (2), (3), (4) represent the input gates, equations (2) and (3) determine what new information is deposited in the cell state, and equation (4) represents the new candidate; equations (5) and (6) represent the output gates.

According to an embodiment of the log sequence anomaly detection framework based on nLSTM-self attribute, a word embedding layer specifies dimensions when generating word vectors, the vectors are initialized by small random numbers, and a back propagation algorithm is adopted for training and updating.

According to an embodiment of the nLSTM-self attribute-based log sequence anomaly detection framework of the present invention, the anomaly detection model further comprises: a linear layer for converting the output result from the attention layer into a probability vector P ═ P (P) with dimension k₁,p₂L p_k) The log file comprises k log templates, p_iThe next log template representing a prediction of the current sequence is e_i。

According to an embodiment of the nLSTM-self association-based log sequence anomaly detection framework of the present invention, an output of an LSTM unit includes a cell state and a hidden state, the hidden state and the cell state of a previous LSTM unit are passed to a next LSTM unit, the hidden state is also passed to a stacked upper layer LSTM as an input, each LSTM unit corresponds to a word embedding of a log template in a sequence, and if h is a sequence length, each layer LSTM includes h LSTM units.

In accordance with one embodiment of the nLSTM-self attribute based log sequence anomaly detection framework of the present invention, wherein,

the self-attention layer firstly calculates the dependency relationship between logs in a sequence and expresses the dependency relationship by a similarity score, and the similarity s (h) is calculated_t,h_s) Performing dot product calculation to obtain a non-normalized score:

α＝s(h_t,h_s)＝Q·Q^T(7)

the self-attention layer takes the hidden states of all LSTM units at the top layer as input, the size of the input data Q is batch _ size multiplied by the sequence length (h) multiplied by the number of hidden state neurons, the size of α obtained after similarity calculation is batch _ size multiplied by h, and the size represents the dependency relationship between every two logs in the sequence;

the non-normalized scores are then normalized as a probability weight for the self-attention value:

weighted summation of the LSTM outputs is a result of the calculation from attention:

the result of the self-attention value is a tensor of size bay _ size × hidden _ size × h, taking the last column new _ hidden of each bay as the final output from the attention layer [: and, -1 ].

According to an embodiment of the nLSTM-self association-based log sequence anomaly detection framework of the present invention, the nLSTM layer embeds the distributed words of each log template obtained by the embedding layer as the input of the LSTM unit at the bottom layer, the hidden state and the cell state of the last LSTM unit are transferred to the next LSTM unit, the sequence information is sequentially propagated from front to back, the hidden state is also correspondingly transferred to the stacked upper layer LSTM, and the output of each LSTM unit at the top layer is used as the output of the nLSTM layer to participate in the calculation of the self-attention layer.

The invention provides a log anomaly detection framework nLSALLog based on nLSTM-self attention, wherein n represents the number of stacked LSTM layers. The framework avoids the complicated feature extraction step of the existing machine learning method, the semantic vector representation of the log template is input into the multilayer LSTM network by means of the strong automatic feature learning capability of deep learning, the state vector of the hidden layer and the output of the multilayer LSTM are used as the input of the self-attention layer, and therefore the context information of the sequence can be better reserved and controlled, and the problem of long-term dependence of the sequence is better solved. The model obtained by normal data training can detect unknown abnormality, and the abnormality detection of the log sequence can be used for positioning the abnormal position, which has great significance for the later abnormality diagnosis.

Drawings

FIG. 1 is a schematic diagram of a log sequence anomaly detection framework based on nLSTM-self attention;

FIG. 2 is a schematic diagram of an anomaly detection model;

FIG. 3 shows a detailed view of the interior of an LSTM cell;

fig. 4a-d are schematic diagrams illustrating anomaly detection and evaluation under parameter settings in 4 models.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

FIG. 1 is a schematic diagram of a log sequence anomaly detection framework based on nLSTM-self attribute, and as shown in FIG. 1, the log sequence anomaly detection framework includes: a training phase and a detection phase. Wherein the training phase comprises: suppose that one log file contains k log templates E ═ { E ═ E₁,e₂L e_kOf the training phaseThe input is a sequence of log templates, a log sequence l of length h_t-h,…l_t-2,l_t-1The log template l contained in_iE, t-h is more than or equal to i is less than or equal to t-1, and the number | l of log templates in a sequence_t-h,...l_t-2,l_t-1And h is less than or equal to m. In order to facilitate data processing, each log template corresponds to a template number, a log template dictionary is generated, and then a normal log template sequence is generated into an input sequence and target data is fed into an anomaly detection model for training. The trained loss function is cross entropy, and the optimization of the loss function is by an adaptive gradient descent method Adam. The detection stage comprises: the data input method is the same as the training stage, the model generated in the training stage is used for carrying out anomaly detection, and the model output is a probability vector P ═ (P)₁,p₂L p_k)，p_iRepresenting the target Log template as e_iThe probability of (2) is actually understood as a multi-class problem, but the final result is a normal and abnormal two-class problem, so that further judgment is needed. According to experience, especially in the case of a small number of log templates, the target log template of one input sequence is more than one, and the log templates corresponding to the first g large probability values in P are considered to be normal. "is the target within the predicted value? The "predicted value" in "is the first g log templates with higher probability. If the actual target data is in the predicted value, the log sequence is judged to be normal, otherwise, the log sequence is judged to be abnormal. The log file comprises a plurality of event types, each event type comprises a plurality of logs, the logs belonging to the same event type have a common template, and a log sequence can be understood as a series of events, namely a sequence of the log template corresponding to an original log sequence. The invention detects the abnormity of the log template sequence corresponding to the original log sequence.

Fig. 2 is a schematic diagram of an anomaly detection model, and as shown in fig. 1 and fig. 2, the anomaly detection model in the training phase includes 3 layers: word embedding layer, nLSTM layer, self attention layer. The word embedding layer takes the log template sequence as input and is used for the front-end input of the anomaly detection framework, and the sequence number of each log template in the sequence is mapped into dense word embedding. The word embedding layer is used as a part of an anomaly detection model, the dimension needs to be specified when a word vector is generated, the vector is initialized by a small random number, and training and updating are performed by adopting a back propagation algorithm. Compared with the open source pre-training packages Word2Vec and GloVe, the Word embedding layer is a slower method, but can customize Word embedding for a specific log data set through model training. Word embedding based on neural network training contains rich context information, semantic rules of target words in the current log sequence can be well expressed, and the purpose of dimension reduction is achieved. The nLSTM layer takes distributed word embedding of each log template obtained by the word embedding layer as input, taking 2-layer LSTM as an example, the output of one LSTM unit includes a cell state and a hidden state, the hidden state and the cell state of the previous LSTM unit are transferred to the next LSTM unit, and the hidden state is also correspondingly transferred to the stacked upper layer LSTM as input. Each LSTM unit at the bottom level corresponds to a word embedding of a log template in the sequence. If h is the sequence length, each layer of LSTM contains h LSTM units.

FIG. 3 shows a detailed view of the interior of an LSTM cell, shown in FIG. 3, x_tWord embedding representing a Log template, C_tIndicating the cellular status of the t-th LSTM unit in the current sequence, h_tRepresenting the hidden state of the t-th LSTM cell in the current sequence, the sigma module represents a sigmoid function, tanh represents a tanh function,

representing dot product, ⊕ representing addition hidden output h of an LSTM cell_tThe derivation process of (a) is shown in equations (1) to (6).

f_t＝σ(W_f·[h_t-1,x_t]+b_f) (1)

i_t＝σ(W_i·[h_t-1,x_t]+b_i) (2)

o_t＝σ(W_o·[h_t-1,x_t]+b_o)(5)

h_t＝o_t*tanh(C_t) (6)

Formula (1) represents a forgetting gate that determines what information to discard from the cell state; formulas (2), (3) and (4) represent the input gates, (2) (3) determine what new information is stored in the cell state, and (4) represent new candidate values; (5) (6) represents the output gate, but this output will be based on the current filtered cell state. For the stacked LSTM case, this output serves as the input to the next-level LSTM unit.

The self-attention layer can learn the internal structure of a sequence, and has many successful applications in reading understanding, text classification, machine translation and the like. The invention uses this idea for the predictive analysis of log sequences. Firstly, calculating the dependency relationship between logs in a sequence, and expressing the dependency relationship by a similarity score, namely, the similarity s (h)_t,h_s) A non-normalized score is obtained by performing a dot product calculation:

α＝s(h_t,h_s)＝Q·Q^T(7)

the self-attention layer takes the hidden states of all the LSTM units at the top layer as input, the size of the input data Q is batch _ size multiplied by the sequence length (h) and the number of hidden state neurons (hidden _ size), the size of α obtained after similarity calculation is batch _ size multiplied by h, and the size represents the dependency relationship between every two logs in the sequence.

Softmax normalization was then performed on the non-normalized score as a probability weight for the attention value:

the weighted summation of the LSTM outputs is the result of calculating the attention value:

the result of the attention value is a tensor of size bay _ size × hidden _ size × h, taking the last column new _ hidden of each bay as the final output from the attention layer [: and, -1 ]. Since self-attention is the amount of attention that is calculated for each and every word, the maximum path length is only 1 regardless of how long there is between them. Long range dependencies can be captured.

Adding a linear layer to the last layer of the detection model, and converting into a probability vector P (P) with dimension k₁,p₂Lp_k) (the log file contains k log templates), p_iThe next log template representing a prediction of the current sequence is e_i。

The experimental platform adopts Ubuntu 16.04LT, a 64-bit system, 62.8GiB internal memory, a processorIntel Xeon (R) CPU E5-2620v4@2.1GHz x 16 processor and a Graphic Geforce GTX 1080Ti/PCIe/SSE2 dual GPU platform. The log dataset HDFS dataset contains 11,175,629 raw log records for 38.7 hours, is a log dataset collected from a cluster of 203 nodes on Amazon EC2 platform, is 1.6G in size, and has 28 templates. For the HDFS log, the block _ id identifier is used as a basic operation unit, and the log contained in each block _ id is regarded as a time domain window. Firstly, dividing the log into 575059 time domain windows according to block _ id, and generating a log template sequence corresponding to an original log sequence contained in each time domain window. Each time domain is equivalent to a sentence and each log template inside is equivalent to a word. If one word is abnormal, the sentence is considered to be abnormal. And selecting a training data set which is the same as the Deeplog, wherein all the training data sets are normal data, and the sample size accounts for less than 1% of all the data sets. Data set BlueGene/L (BGL), a public, partially labeled data set from IBM's well-known high Performance computing laboratory (Lawrence Livermore National Labs, LLNL), contains 4,747,963 raw logs of 215 days, and is 708M in size. The BGL anomaly detection is that a template sequence corresponding to an original log is divided by adopting a sliding window, logs contained in 6 or 8 hours are taken as one line, and the length of some logs is hundreds of thousands, and the number of the logs is less than 10, so that the judgment basis of' one log in one line is abnormal and one sentence is not reasonable under the division. In the invention, the BGL data set is not divided according to a fixed time window, but divided and predicted according to the sequence length specified in the model parameters, and a sliding window method is utilized to take down one sequence. 80% of the data set was selected as the training set. The information for both data sets is shown in table 1. The last two columns in the table "number of abnormal/normal (windows)" are the number of abnormal/normal logs for BGL and the number of abnormal/normal session windows for HDFS.

TABLE 1 data set information

The evaluation criteria included: for the problem of abnormality detection, the past literature often adopts Precision, Recall and F1 values for judging the problem, but these indexes do not consider the problem that most log data sets are unbalanced data sets, that is, more normal logs and fewer abnormal logs are included. The detection performance of the nLSTM-self-orientation-based anomaly detection framework provided by the invention is evaluated mainly by measuring the effect of detecting the log sequence, mainly adopting three indexes of TPR, FPR and accuracy, respectively inspecting the detection effect of the real sample condition of the anomalous log sequence and the normal log sequence, and inspecting the overall accuracy, thus being not influenced by the unbalance of the data set. The anomaly log detection confusion matrix is shown in table 2.

In table 2, TP represents the number of abnormal log sequences correctly detected as abnormal; FN indicates the number of abnormal log sequences that were erroneously detected as normal; FP represents the number of normal log sequences that were erroneously detected as anomalous; TN indicates the number of normal log sequences correctly detected as normal. The corresponding several evaluation indexes are as follows.

Table 2 log sequence anomaly detection confusion matrix

(1) The true class rate (TP), also called detection rate, represents the ratio of the number of abnormal log sequences that are correctly detected as abnormal to the total number of abnormal log sequences, with higher values yielding better performance. The calculation formula is as follows:

(2) the False Positive Rate (FPR), also called false positive rate, represents the ratio of the normal log sequence that is erroneously detected as abnormal to the total number of the actual normal log sequence, and the smaller the value, the better the performance. The calculation formula is as follows:

(3) accuracy (Accuracy), which represents the ratio of the number of correct samples to the total number of samples in the test result, the larger the value, the better the performance. The calculation formula is as follows:

the specific process is as follows: respectively loading the normal log set and the abnormal log set in the test set into trained models for testing by taking the same sequence length as that in the training process, and adding 1 to the normal log sequence set and FP if the first g log templates obtained by the current sequence prediction of the models do not contain the actual next log template; for an abnormal log sequence set, TP is incremented by 1. If one of the predictions in a sequence is abnormal and one is abnormal.

The performance analysis includes: and (4) setting experimental parameters, and determining the optimal parameters of the model by repeatedly testing different parameter combinations. And performing comparison experiments by adopting the optimal parameters in subsequent experiments. For the HDFS dataset, the sequence length h is set to 10, the word embedding dimension input _ size is set to 10, the number of hidden layer nodes in LSTM unit hidden layer _ size is set to 64, the number of LSTM layer layers is set to 2, the learning rate lr is set to 0.001, and in the detection phase, g is set to 9 as described in section 3.2. For the BGL dataset, the sequence length h is set to 10, the word embedding dimension input _ size is set to 100, the LSTM unit hidden layer node number hidden _ size is set to 64, the LSTM layer numbers layers is set to 2, the learning rate lr is set to 0.0005, and in the detection phase, g is set to 20.

The analysis of the experimental results comprises: the method is used for respectively carrying out experiments on the BGL data set and the HDFS data set, and comparing the experiments with the 2-layer recurrent neural network RNN and GRU to obtain the 2-layer LSTM which is superior to the task. Furthermore, in order to illustrate the influence of the word embedding layer and the self-attention layer on the detection model, the detection model is divided into three conditions of 2LSTM [9], word embedding layer +2LSTM and word embedding layer +2LSTM + self-attention, and the same parameter setting is adopted for testing respectively. The results of the experiment are shown in tables 3 and 4.

As can be seen from Table 3, the 2LSTM detection model is superior to the 2RNN and 2GRU models, the performance of the detection model is greatly improved after the 2LSTM is added with the word embedding layer, the TPR and the FPR are optimized after the self-attention layer is added, and the total accuracy is improved. The number of abnormal log sequences correctly detected as abnormalities is increased by 33 more, and the number of normal log sequences erroneously detected as abnormalities is decreased by 4.

As can be seen from Table 4, the FPR obtained by detecting the BGL data set by the recurrent neural network models 2RNN, 2GRU and 2LSTM is higher, and the accuracy of the 2LSTM is the highest among the three models. After the 2LSTM is added with the word embedding layer, the FPR is reduced by 57%, the TPR is only reduced by 0.06%, the total accuracy is improved by 9.8%, further, after the attention layer is added, the FPR is reduced by 18.2%, the TPR still has small change, and the total accuracy is improved.

Through experimental verification on two data sets, the word embedding layer +2LSTM + self-attention layer model greatly improves the overall detection performance, particularly, the 2LSTM plus the word embedding layer greatly improves the model detection performance, which shows the effectiveness of word embedding on hidden log mode mining, because a log sequence embodies a causal relationship, the causal relationship is a semantic relationship, semantic features model a log template sequence through word embedding, then the 2LSTM + self-attention layer automatically learns hidden semantic information of a whole sentence, and experimental results also show the effectiveness of the natural language processing thought for reference. The self-attention layer obtains corresponding dependency scores between every two log templates in the current sequence, so that the dependency relationship between every two log templates can be better represented, and the causal relationship in the sequence can be better reflected. Therefore, after the self-attention layer is added, experiments prove that the detection performance of the model is improved.

The data set in table 1 was tested on the corresponding pre-trained model using 2LSTM, word embedding layer +2LSTM and the method of the present invention, respectively, the experimental environment is as described in section 4.1, and the run time is as shown in tables 5 and 6. From the table, it is seen that the test time does not increase with the complexity of the model, but instead the test time decreases by 11% after the input vector dimensions are reduced by word embedding. After the self-attention layer is added, the test time is basically not changed, but the detection performance is improved. The test times for the three models on the BGL data set did not change significantly because the test set was small and the difference in run time was not apparent. Therefore, the detection model of the invention increases the detection effect on the basis of not increasing the time cost.

TABLE 3 comparison of the test results of different test models on HDFS dataset

TABLE 4 comparison of detection Effect of different detection models on BGL dataset

TABLE 5 run time (HDFS)

And (3) analyzing parameter sensitivity: several main parameters: the basis of setting the number of LSTM layers, word vector dimension input _ size, hidden layer node number hidden _ size, and sequence length h, taking BGL dataset as an example, fig. 4 shows an abnormality detection evaluation schematic diagram under different parameter settings, as shown in fig. 4, only the parameters to be evaluated are changed during the experiment, and the other parameters are kept the optimal settings unchanged. The evaluation indexes with data labels in the figure are FPR and Accuracy. On the basis of high accuracy, there is a balanced state between relatively low FPR and relatively high TPR, which is the basis for selecting the optimal parameters. As shown in fig. 4(c), when the word embedding dimension is set to 100, there is a relatively balanced state of FPR and TPR, because when the word embedding dimension is set to 90, although there is a higher TPR, at the expense of a high FPR (the lower the FPR, the better), the overall accuracy is rather reduced.

In order to fully utilize the dependency relationship among log sequences, the invention provides a general log sequence anomaly detection framework based on nLSTM-Self attribute, a model obtained by training normal data can detect unknown anomalies, the model can embed words of a log template into a multi-layer LSTM network by means of the strong automatic learning characteristic capacity of deep learning, and the obtained hidden layer state vector and the output of the multi-layer LSTM are used as the input of a Self-Attention layer, so that the information of all logs in the sequence can be better focused, the long-term dependence problem of the sequence can be better solved, a linear layer is added in the last layer of the detection model and converted into a probability vector, and the prediction of the current sequence is completed. Experimental results show that the model provided by the invention has certain flexibility, can well detect the abnormity in the log, has no increase of the running time due to the complexity of the model, and achieves the best detection effect in the field of log sequence abnormity detection at present. Next, we will continue to develop the study in two ways: the method has the advantages that firstly, the setting of g in the detection stage is further researched, the g is uniformly set, and the better setting method is that the possible states of the next log event of the current sequence can be automatically identified according to different sequences; and secondly, performing abnormity diagnosis on the basis of log abnormity detection, positioning an abnormity position and analyzing an abnormity reason, and providing help for network and system administrators.

The invention has the advantages that: (1) a general log sequence anomaly detection framework (nLSALLog) based on nLSTM-Self Attention is proposed, and a Self-attentive mechanism is first used for log anomaly detection. (2) Theoretical analysis of nLSALLog is provided, and the correctness and the expandability of the nLSALLog are illustrated; (3) the detection performance and the cost of the nLSALLog provided by the invention are verified through experiments, and the basis of parameter setting is discussed.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. An nLSTM-self attribute based log sequence anomaly detection framework, comprising: training a model and an anomaly detection model; the training model comprises the following steps: suppose that one log file contains k log templates E ═ { E ═ E₁,e₂Le_kThe input of the training model is a sequence of log templates, a log sequence l with the length of h_t-h,…l_t-2,l_t-1The log template l contained in_iE, t-h is more than or equal to i is less than or equal to t-1, and the number | l of log templates in a sequence_t-h,...l_t-2,l_t-1If m is less than or equal to h, each log template corresponds to a template number, a log template dictionary is generated, and then a normal log template sequence is generated into an input sequence and target data are fed into an abnormal detection model for training; the detection stage comprises: the data input method is the same as the training stage, the model generated in the training stage is used for carrying out anomaly detection, and the model output is a probability vector P ═ (P)₁,p₂L p_k)，p_iRepresenting the target Log template as e_iIf the actual target data is in the predicted value, the log sequence is judged to be normal, otherwise, the log sequence is judged to be abnormal.

2. The nLSTM-self attribute based log sequence anomaly detection framework of claim 1, wherein the trained loss function is cross entropy and the optimized adoption of the loss function is an adaptive gradient descent method.

3. The nLSTM-self attribute-based log sequence anomaly detection framework of claim 1, wherein a log file comprises a plurality of event types, each event type comprises a plurality of logs, the logs belonging to the same event type have a common template, and a log sequence is used as a sequence of occurring events, namely a sequence of log templates corresponding to an original log sequence. And detecting the abnormity of the log template sequence corresponding to the original log sequence.

4. The nLSTM-self attribute based log sequence anomaly detection framework of claim 1, wherein the anomaly detection model comprises in combination: the word embedding layer, the n layers of long-time memory neural network layer and the self-attention layer; the word embedding layer takes the log template sequence as input and is used for the front-end input of the anomaly detection framework, and the serial number of each log template in the sequence is mapped into dense word embedding; the nLSTM layer takes the distributed word embedding of each log template obtained by the word embedding layer as input; the self-attention layer firstly calculates the dependency relationship among logs in a sequence, takes the hidden states of all long-time memory neural network units at the top layer as the input of the self-attention layer, performs similarity calculation, and then performs normalization processing as the probability weight of the self-attention value: the weighted summation of the outputs of the n layers of long-time memory neural network layers is the result of the self-attention value.

5. The nLSTM-self attention-based log sequence exception detection of claim 1Survey frame, its characterized in that, LSTM unit includes: x is the number of_tWord embedding representing a Log template, C_tIndicating the cellular status of the t-th LSTM unit in the current sequence, h_tRepresenting the hidden state of the t-th LSTM cell in the current sequence, the sigma module represents a sigmoid function, tanh represents a tanh function,

it is shown that the dot-product,

indicating the addition, a hidden output h of an LSTM cell_t：

f_t＝σ(W_f·[h_t-1,x_t]+b_f) (1)

i_t＝σ(W_i·[h_t-1,x_t]+b_i) (2)

o_t＝σ(W_o·[h_t-1,x_t]+b_o) (5)

h_t＝o_t*tanh(C_t) (6)

6. The nLSTM-self association-based log sequence anomaly detection framework of claim 4, wherein the word embedding layer specifies dimensions when generating word vectors, the vectors are initialized with small random numbers, and a back propagation algorithm is used for training and updating.

7. The nLSTM-self attribute based log sequence anomaly detection framework of claim 4, wherein the anomaly detection model further comprises: a linear layer for converting the output result from the attention layer into a probability vector P ═ P (P) with dimension k₁,p₂Lp_k) The log file comprises k log templates, p_iThe next log template representing a prediction of the current sequence is e_i。

8. The nLSTM-self association-based log sequence anomaly detection framework of claim 1, wherein the output of an LSTM unit comprises cell states and hidden states, the hidden states and cell states of a previous LSTM unit are passed to a next LSTM unit, the hidden states are also passed to a stacked upper layer LSTM, and as its input, each LSTM unit corresponds to a word embedding of a log template in the sequence, and if h is the sequence length, each layer LSTM contains h LSTM units.

9. The nLSTM-self attribute based log sequence anomaly detection framework of claim 4,

α＝s(h_t,h_s)＝Q·Q^T(7)

10. The nLSTM-self association-based log sequence anomaly detection framework of claim 4, wherein the nLSTM layer embeds the distributed words of each log template obtained by the embedding layer as the input of the LSTM unit at the bottom layer, the hidden state and the cell state of the last LSTM unit are transmitted to the next LSTM unit, the sequence information is sequentially propagated from front to back, the hidden state is also correspondingly transmitted to the upper layer LSTM of the stack, and the output of each LSTM unit at the top layer is used as the output of the nLSTM layer to participate in the calculation of the self-attention layer.