CN117421195A

CN117421195A - Multi-stage abnormality detection method and device for logs

Info

Publication number: CN117421195A
Application number: CN202311482357.3A
Authority: CN
Inventors: 于中江; 陶刚; 杨绍平; 余洋; 李忠态
Original assignee: China Tobacco Yunnan Industrial Co Ltd
Current assignee: China Tobacco Yunnan Industrial Co Ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-01-19

Abstract

The application discloses a log multistage abnormality detection method and device, wherein the method comprises the following steps: preprocessing log data in an original log sequence to obtain a preprocessed log sequence; the preprocessed log sequence is subjected to preliminary detection, wherein the preliminary detection comprises sequential detection; and if the result of the primary detection is that no abnormality exists, performing secondary detection on the preprocessed log sequence. According to the method and the device, the log data are subjected to deep analysis from multiple perspectives, multi-stage anomaly detection is achieved, the accuracy of anomaly detection is improved, and the robustness of the model is improved.

Description

Multi-stage abnormality detection method and device for logs

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for detecting a log abnormality in multiple stages.

Background

Log anomaly detection is a technique for monitoring and identifying anomalous behavior and patterns in a system log. It is typically used to assist a system administrator or security team in detecting potential security threats or fault conditions. The conventional log anomaly detection method relies on manual analysis, which consumes a great deal of manpower or financial resources, and the detection efficiency is often not satisfactory.

At present, the main log abnormality detection methods at home and abroad are of two types: methods of supervising classes and methods of unsupervised. The method of supervision class is trained by using training data with labels, and the log data is detected. Because the unsupervised method does not need to rely on the labeling of data, a large amount of log data is unlabeled, and therefore, the method can be well applied to a real environment.

However, log anomaly detection of the current large-scale software system is single-stage detection no matter in supervision or unsupervised, and the problems of low anomaly detection accuracy and poor model robustness exist.

Disclosure of Invention

According to the multi-stage abnormality detection method and device for the logs, the log data are subjected to depth analysis from multiple view angles, multi-stage abnormality detection is achieved, the accuracy of abnormality detection is improved, and the robustness of a model is improved.

The application provides a multi-stage abnormality detection method of a log, which comprises the following steps:

preprocessing log data in an original log sequence to obtain a preprocessed log sequence;

the preprocessed log sequence is subjected to preliminary detection, wherein the preliminary detection comprises sequential detection;

and if the result of the primary detection is that no abnormality exists, performing secondary detection on the preprocessed log sequence.

Preferably, a gating loop network is adopted to perform secondary detection on the preprocessed log sequence.

Preferably, the sequential detection comprises:

extracting a detection sequence from the preprocessed log sequence according to a preset detection window size; the next log of the detection sequence in the preprocessed log sequence is a target log;

predicting the probability that the next log of the detection sequence is a target log;

judging whether the probability is smaller than a threshold value;

if so, the preprocessed log sequence is an abnormal sequence.

Preferably, preprocessing log data in an original log sequence to obtain a preprocessed log sequence, which specifically includes:

analyzing each log data in the original log sequence to obtain a corresponding log template;

vectorizing each log template to obtain a semantic vector of each log data;

and all the semantic vectors are arranged according to the sequence of the corresponding log data to obtain a preprocessed log sequence.

Preferably, vectorizing each log template to obtain a semantic vector of each log data, which specifically includes:

deleting non-character marks and pause words in the log template, and splitting the combined words in the log template into two or more words;

converting the word into a word vector;

calculating TF-IDF weight of each word vector in the log template in all word vectors converted by the original log sequence;

and calculating semantic vectors of the log data by using all word vectors and corresponding TF-IDF weights in the log template.

The application also provides a log multistage abnormality detection device, which comprises a preprocessing module, a primary detection module and a secondary detection module;

the preprocessing module is used for preprocessing the log data in the original log sequence to obtain a preprocessed log sequence;

the preliminary detection module is used for carrying out preliminary detection on the preprocessed log sequence, wherein the preliminary detection comprises sequential detection;

and the secondary detection module is used for carrying out secondary detection on the preprocessed log sequence when the primary detection result is that no abnormality exists.

Preferably, the secondary detection module is used for performing secondary detection on the preprocessed log sequence by adopting a gating circulation network.

Preferably, the preliminary detection module comprises an extraction module, a prediction module, a judgment module and an abnormality judgment module;

the extraction module is used for extracting a detection sequence from the preprocessed log sequence according to a preset detection window size; the next log of the detection sequence in the preprocessed log sequence is a target log;

the prediction module is used for predicting the probability that the next log of the detection sequence is the target log;

the judging module is used for judging whether the probability is smaller than a threshold value;

and the real-time abnormality judgment module is used for judging that the preprocessed log sequence is an abnormal sequence when the probability is smaller than the threshold value.

Preferably, the preprocessing module comprises an analysis module, a semantic vector acquisition module and a sequencing module;

the analysis module is used for analyzing each log data in the original log sequence to obtain a corresponding log template;

the semantic vector obtaining module is used for carrying out vectorization on each log template to obtain the semantic vector of each log data;

the ordering module is used for ordering all the semantic vectors according to the sequence of the corresponding log data to obtain a preprocessed log sequence.

Preferably, the semantic vector obtaining module comprises a deletion splitting module, a conversion module, a weight calculating module and a weight calculating module;

the deletion splitting module is used for deleting the non-character marks and the pause words in the log template and splitting the combined words in the log template into two or more words;

the conversion module is used for converting words into word vectors;

the weight calculation module is used for calculating TF-IDF weights of all word vectors converted by the original log sequence in each word vector in the log template;

the weighting calculation module is used for calculating semantic vectors of the log data by utilizing all word vectors and corresponding TF-IDF weights in the log template.

Other features of the present application and its advantages will become apparent from the following detailed description of exemplary embodiments of the present application, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of a multi-stage anomaly detection method for logs provided herein;

FIG. 2 is a schematic flow chart of obtaining semantic vectors of log data according to the present application;

fig. 3 is a block diagram of a multistage abnormality detection apparatus for logs provided in the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods, and apparatus should be considered part of the specification.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

As shown in fig. 1, the multi-stage abnormality detection method for a log provided in the present application includes:

s110: preprocessing log data in the original log sequence to obtain a preprocessed log sequence.

As an embodiment, preprocessing log data in an original log sequence to obtain a preprocessed log sequence, which specifically includes:

s1101: and analyzing each log data in the original log sequence to obtain a corresponding log template.

As one example, a common dataset may be downloaded from a GitHub as an original log sequence.

Because the original log sequence is unstructured data and contains a lot of specific information (such as IP addresses, file names, etc.) which can prevent the automatic analysis of the log, each piece of log data needs to be parsed, and the information is extracted by abstracting parameters in the log data, so that the log data becomes structured data for subsequent analysis.

As one embodiment, a Drain algorithm is used to parse log data in an original log sequence to obtain a series of log templates. Table 1 is one embodiment of log data parsing.

TABLE 1

S1102: and vectorizing each log template to obtain the semantic vector of each log data.

As an embodiment, as shown in fig. 2, vectorizing each log template to obtain a semantic vector of each log data, which specifically includes:

p1: preprocessing the log template, deleting the non-character marks and the stop words in the log template, and splitting the combined words in the log template into two or more words.

The log template is regarded as sentences in natural language, and some non-character marks, pause words and combined words exist in the log template, all the non-character marks and pause words are deleted firstly, then the combined words are split, and the combined words are changed into two or more words.

P2: the word is converted into a word vector.

Word vectorization is performed by the FastText algorithm, as one example. FastText can adequately capture word-to-word internal relationships in English sentences, such as semantic similarity, etc.

P3: and calculating TF-IDF weight of each word vector in the log template in all word vectors converted by the original log sequence. The TF-IDF weights can effectively weigh the importance of words in sentences, which meets the high resolution requirement. For example, if the word "Block" frequently appears in a certain log sequence, it means that the word may be more representative in the log sequence. We therefore use Term Frequency (TF) to describe its importance.

In particular, the method comprises the steps of,

where #word is the number of target words in the log sequence, # total is the total number of all words in the log sequence.

On the other hand, if the word "Block" appears in all log sequences, it will become too common to distinguish between these log sequences and should therefore be de-weighted. Therefore, we also use the inverse document frequency (Inverse Document Frequency, IDF) as a metric:

where #L is the total number of all log sequences, and #Lword is the total number of log sequences containing the target word.

For each word, its TF-IDF weights are calculated from tf×idf.

P4: calculating a semantic vector V of log data by using all word vectors and corresponding TF-IDF weights in a log template:

therefore, the semantic vector corresponding to each log data can not only identify log sequences with similar semantics, but also distinguish different log sequences, and the robustness is greatly improved.

S1103: and all the semantic vectors are arranged according to the sequence of the corresponding log data to obtain a preprocessed log sequence.

S120: and (5) performing preliminary detection on the preprocessed log sequence. If the primary detection result is that the abnormality exists, an abnormality alarm is sent out; if the preliminary detection results in no abnormality, S130 is performed.

As one example, the pre-processed log sequence is initially detected by a long and short memory network (Long Short Term Memory, LSTM) model. LSTM is a variant of Recurrent Neural Network (RNN) designed to process and model long-term dependencies in sequence data. Compared with the traditional RNN, the LSTM solves the problems of gradient disappearance, gradient explosion and the like by introducing a gating mechanism, thereby better capturing and memorizing long-term dependency in sequence data.

The procedure of the log is usually performed according to a fixed flow, and thus, the normal log naturally has some sequential pattern. In other words, for a given log sequence, if no anomalies occur, the next log template of the current log template is predictable. Thus, in the present application, the order relation between logs is learned by using the LSTM model.

As one embodiment, the preliminary detection includes sequential detection.

As one embodiment, the sequence detection includes:

s1201: and extracting a detection sequence from the preprocessed log sequence according to a preset detection window size. The next log of the detection sequence in the preprocessed log sequence is the target log.

For the preprocessed log sequence Ω= { v1, v2, …, vn }, W is the window size for detection, S is the step size of the window, the threshold for P prediction. For example, the semantic vector set of one log sequence is [ v3, v1, v4, v6, v1, v7, v3, v5], the detection window size w=3, and the detection sequence (v 3, v1, v 4) is extracted according to the window size. In the log sequence, the target log of the detection sequence (v 3, v1, v 4) is v6.

S1202: predicting the probability that the next log of the detection sequence is the target log.

Specifically, the trained LSTM model is used for prediction, and the probability that the next log of the detection sequence (v 3, v1, v 4) is v6 is p1 is obtained.

S1203: and judging whether the probability is smaller than a threshold value. If yes, executing S1204; otherwise, S1205 is executed.

S1204: if the probability P1 is smaller than the threshold value P, judging that the preprocessed log sequence is an abnormal sequence, and sending out an abnormal alarm.

S1205: if the probability P1 is not smaller than the threshold value P, judging that the log sequence after preprocessing is a normal sequence.

In addition to the sequential pattern, the template (vector) sequence also has a quantitative pattern. Normally, normal program execution has some invariants and always maintains some quantitative relationship in the log under different inputs and workloads. For example, each open file may eventually be closed at some stage. Therefore, the number of logs displaying "open files" should be equal to the number of logs displaying "closed files" under normal conditions. These quantitative relationships in the log can capture normal program execution behavior. If the new log destroys some invariants, it can be determined that an exception occurred during system execution. Thus, in the present application, quantitative relationships between logs are also learned using the LSTM model.

Preferably, the preliminary detection further comprises quantitative relationship detection. If the preprocessed log sequence accords with the quantitative relation, the preprocessed log sequence is judged to be a normal sequence, and the step S130 is continuously executed. Otherwise, judging the preprocessed log sequence as an abnormal sequence, and sending out an abnormal alarm.

S130: and if the result of the primary detection is that no abnormality exists, performing secondary detection on the preprocessed log sequence.

As an embodiment, a gating cycle (Gated Recurrent Unit, GRU) network is used to perform secondary detection on the preprocessed log sequence, that is, the preprocessed log sequence obtained in S110 is input into the GRU network, so as to obtain the probability that the preprocessed log sequence output by the GRU network is an abnormal sequence. If the probability is greater than the threshold value, judging that the preprocessed log sequence is an abnormal sequence, outputting an abnormal result, and sending out an abnormal alarm; otherwise, judging the preprocessed log sequence as a normal sequence, and outputting a normal result.

In the application, in the training stage of the GRU network, on the basis of known normal log sequences (the labels of the log sequences are normal), the label of the unlabeled log sequence in the training set is further estimated by referring to the thought of PU learning (Positive and Unlabelled Learning, positive sample and unlabeled learning), and the clustering is adopted to identify the log sequences with similar semantics.

As one embodiment, training of the GRU network includes the steps of:

q1: all log sequences in the training set (including log sequences with labels (normal labels or abnormal labels) and log sequences without labels) are clustered, so that each clustered group is more likely to contain log sequences with similar semantics.

As one example, clustering is performed using the HDBSCAN algorithm (Hierarchical Density-Based Spatial Clustering of Application with Noise).

Q2: and predicting labels of the log sequence without labels according to the clustering result to obtain an optimized training set.

As one embodiment, each unlabeled log sequence is assigned a probabilistic label by measuring the probability that the unlabeled log sequence belongs to each label to reduce the effects of noise in model training.

Specifically, if a log sequence a with a label exists in a cluster group where a log sequence B without a label exists, a probability that the log sequence B without a label in the same group has the same label as the log sequence a is inferred through the log sequence a. If the log sequence B without the label is in the cluster group with the label, judging by searching the cluster group similar to the cluster group. Thus, each log sequence has a label, so that the optimized training set is more learning-capable.

Q3: and training the GRU network by utilizing the optimized training set to obtain a trained GRU model.

Based on the method, the application also provides a log multi-stage abnormality detection device. As shown in fig. 3, the multi-stage abnormality detection apparatus of the log includes a preprocessing module 310, a primary detection module 320, and a secondary detection module 330.

The preprocessing module 310 is configured to preprocess log data in an original log sequence, and obtain a preprocessed log sequence.

The preliminary detection module 320 is configured to perform preliminary detection on the preprocessed log sequence, where the preliminary detection includes sequential detection.

The secondary detection module 330 is configured to perform secondary detection on the preprocessed log sequence when the result of the primary detection is that no abnormality exists.

Preferably, the secondary detection module 330 is configured to perform secondary detection on the preprocessed log sequence by using the gated loop network.

Preferably, the preliminary detection module 320 includes an extraction module 3201, a prediction module 3202, a judgment module 3203, and an abnormality judgment module 3204.

The extraction module 3201 is configured to extract a detection sequence from the preprocessed log sequence according to a preset detection window size. The next log of the detection sequence in the preprocessed log sequence is the target log.

The prediction module 3202 is configured to predict a probability that a next log of the detection sequence is a target log.

The decision module 3203 is configured to determine whether the probability is less than a threshold.

The real-time anomaly determination module 3204 is configured to determine that the preprocessed log sequence is an anomaly sequence when the probability is less than a threshold.

Preferably, the preprocessing module 310 includes a parsing module 3101, a semantic vector obtaining module 3102, and a ranking module 3103.

The parsing module 3101 is configured to parse each log data in the original log sequence to obtain a corresponding log template.

The semantic vector obtaining module 3102 is configured to vectorize each log template to obtain a semantic vector of each log data.

The sorting module 3103 is configured to sort all semantic vectors according to the sequence of the corresponding log data, and obtain a preprocessed log sequence.

Preferably, the semantic vector obtaining module 3102 includes a delete splitting module, a transform module, a weight calculation module, and a weight calculation module.

The deletion splitting module is used for deleting the non-character marks and the pause words in the log template and splitting the combined words in the log template into two or more words.

The translation module is used for translating words into word vectors.

The weight calculation module is used for calculating TF-IDF weights of all word vectors converted by the original log sequence in each word vector in the log template.

Preferably, the multi-stage anomaly detection device is a trained neural network model, at least comprising an LSTM model for primary detection and a GRU network model for secondary detection, and in the training process, the whole model is optimized through the coordinated training of the two models.

To illustrate the detection effect of the present application, the baseline system was compared to the model of the present application using the common dataset HDFS and BGL. Table 2 shows the performance comparison experimental results of the present application with the baseline model on HDFS, and table 3 shows the performance comparison experimental results of the present application with the baseline model on BGL.

TABLE 2 comparative experiments on HDFS

Method	Accuracy of	Recall rate of recall	F1-score
				DeepLog	0.953	0.961	0.957
LogAnomaly	0.960	0.940	0.950
				LogRobust	0.980	1.000	0.999
CNN	0.946	0.995	0.970
				Model herein	0.997	0.998	0.998

Table 3 comparative experiments on BGL

Method	Accuracy of	Recall rate of recall	F1-score
				DeepLog	0.900	0.960	0.929
LogAnomaly	0.970	0.940	0.960
				LogRobust	0.912	0.964	0.937
CNN	0.966	0.977	0.972
				Model herein	0.989	0.977	0.983

From table 2 and table 3, it can be seen that the model of the present application achieves good effects on both the HDFS data set and the BGL data set, and experiments prove the effectiveness of the multi-stage anomaly detection method of the present application. According to the method and the device, the log data are deeply mined from multiple angles, and the utilization rate of information contained in the log data is higher, so that the method and the device have good manifestation on accuracy, recall rate and F1-score, and the problems of low anomaly detection accuracy and poor model robustness in the current large-scale model are solved.

Although specific embodiments of the present application have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present application. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present application. The scope of the application is defined by the appended claims.

Claims

1. A multi-stage anomaly detection method for a log, comprising:

and if the result of the preliminary detection is that no abnormality exists, performing secondary detection on the log sequence after pretreatment.

2. The method of claim 1, wherein the preprocessed log sequence is detected twice using a gated loop network.

3. The multi-stage anomaly detection method of a log of claim 1, wherein the sequential detection comprises:

extracting a detection sequence from the preprocessed log sequence according to a preset detection window size; wherein, the next log of the detection sequence in the preprocessed log sequence is a target log;

predicting the probability that the next log of the detection sequence is the target log;

judging whether the probability is smaller than a threshold value or not;

if yes, the preprocessed log sequence is an abnormal sequence.

4. The method for detecting the multi-stage abnormality of the log according to claim 1, wherein preprocessing is performed on log data in an original log sequence to obtain a preprocessed log sequence, specifically comprising:

vectorizing each log template to obtain a semantic vector of each log data;

and all the semantic vectors are arranged according to the sequence of the corresponding log data to obtain the preprocessed log sequence.

5. The method for multi-stage anomaly detection of logs according to claim 4, wherein vectorizing each log template to obtain a semantic vector for each log data comprises:

deleting the non-character marks and the pause words in the log template, and splitting the combined words in the log template into two or more words;

converting the word into a word vector;

6. The multi-stage abnormality detection device for the logs is characterized by comprising a preprocessing module, a primary detection module and a secondary detection module;

the preprocessing module is used for preprocessing log data in the original log sequence to obtain a preprocessed log sequence;

the preliminary detection module is used for carrying out preliminary detection on the preprocessed log sequence, and the preliminary detection comprises sequential detection;

7. The multi-stage anomaly detection device of claim 6, wherein the secondary detection module is configured to perform secondary detection on the preprocessed log sequence using a gated loop network.

8. The multi-stage abnormality detection device according to claim 6, wherein the preliminary detection module includes an extraction module, a prediction module, a judgment module, and an abnormality judgment module;

the extraction module is used for extracting a detection sequence from the preprocessed log sequence according to a preset detection window size; wherein, the next log of the detection sequence in the preprocessed log sequence is a target log;

the judging module is used for judging whether the probability is smaller than a threshold value or not;

and the real-time abnormality judgment module is used for judging that the preprocessed log sequence is an abnormal sequence when the probability is smaller than a threshold value.

9. The multi-stage anomaly detection device of claim 6, wherein the preprocessing module comprises an parsing module, a semantic vector obtaining module, and a sorting module;

the semantic vector obtaining module is used for carrying out vectorization on each log template to obtain a semantic vector of each log data;

the sorting module is used for arranging all semantic vectors according to the sequence of the corresponding log data to obtain the preprocessed log sequence.

10. The multi-stage anomaly detection device of claim 9, wherein the semantic vector obtaining module comprises a delete splitting module, a conversion module, a weight calculation module, and a weight calculation module;

the conversion module is used for converting the words into word vectors;

the weight calculation module is used for calculating TF-IDF weight of each word vector in the log template in all word vectors converted by the original log sequence;