CN117421195A - Multi-stage abnormality detection method and device for logs - Google Patents

Multi-stage abnormality detection method and device for logs Download PDF

Info

Publication number
CN117421195A
CN117421195A CN202311482357.3A CN202311482357A CN117421195A CN 117421195 A CN117421195 A CN 117421195A CN 202311482357 A CN202311482357 A CN 202311482357A CN 117421195 A CN117421195 A CN 117421195A
Authority
CN
China
Prior art keywords
log
sequence
detection
module
preprocessed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311482357.3A
Other languages
Chinese (zh)
Inventor
于中江
陶刚
杨绍平
余洋
李忠态
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Tobacco Yunnan Industrial Co Ltd
Original Assignee
China Tobacco Yunnan Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Tobacco Yunnan Industrial Co Ltd filed Critical China Tobacco Yunnan Industrial Co Ltd
Priority to CN202311482357.3A priority Critical patent/CN117421195A/en
Publication of CN117421195A publication Critical patent/CN117421195A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3495Performance evaluation by tracing or monitoring for systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/865Monitoring of software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Security & Cryptography (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a log multistage abnormality detection method and device, wherein the method comprises the following steps: preprocessing log data in an original log sequence to obtain a preprocessed log sequence; the preprocessed log sequence is subjected to preliminary detection, wherein the preliminary detection comprises sequential detection; and if the result of the primary detection is that no abnormality exists, performing secondary detection on the preprocessed log sequence. According to the method and the device, the log data are subjected to deep analysis from multiple perspectives, multi-stage anomaly detection is achieved, the accuracy of anomaly detection is improved, and the robustness of the model is improved.

Description

Multi-stage abnormality detection method and device for logs
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for detecting a log abnormality in multiple stages.
Background
Log anomaly detection is a technique for monitoring and identifying anomalous behavior and patterns in a system log. It is typically used to assist a system administrator or security team in detecting potential security threats or fault conditions. The conventional log anomaly detection method relies on manual analysis, which consumes a great deal of manpower or financial resources, and the detection efficiency is often not satisfactory.
At present, the main log abnormality detection methods at home and abroad are of two types: methods of supervising classes and methods of unsupervised. The method of supervision class is trained by using training data with labels, and the log data is detected. Because the unsupervised method does not need to rely on the labeling of data, a large amount of log data is unlabeled, and therefore, the method can be well applied to a real environment.
However, log anomaly detection of the current large-scale software system is single-stage detection no matter in supervision or unsupervised, and the problems of low anomaly detection accuracy and poor model robustness exist.
Disclosure of Invention
According to the multi-stage abnormality detection method and device for the logs, the log data are subjected to depth analysis from multiple view angles, multi-stage abnormality detection is achieved, the accuracy of abnormality detection is improved, and the robustness of a model is improved.
The application provides a multi-stage abnormality detection method of a log, which comprises the following steps:
preprocessing log data in an original log sequence to obtain a preprocessed log sequence;
the preprocessed log sequence is subjected to preliminary detection, wherein the preliminary detection comprises sequential detection;
and if the result of the primary detection is that no abnormality exists, performing secondary detection on the preprocessed log sequence.
Preferably, a gating loop network is adopted to perform secondary detection on the preprocessed log sequence.
Preferably, the sequential detection comprises:
extracting a detection sequence from the preprocessed log sequence according to a preset detection window size; the next log of the detection sequence in the preprocessed log sequence is a target log;
predicting the probability that the next log of the detection sequence is a target log;
judging whether the probability is smaller than a threshold value;
if so, the preprocessed log sequence is an abnormal sequence.
Preferably, preprocessing log data in an original log sequence to obtain a preprocessed log sequence, which specifically includes:
analyzing each log data in the original log sequence to obtain a corresponding log template;
vectorizing each log template to obtain a semantic vector of each log data;
and all the semantic vectors are arranged according to the sequence of the corresponding log data to obtain a preprocessed log sequence.
Preferably, vectorizing each log template to obtain a semantic vector of each log data, which specifically includes:
deleting non-character marks and pause words in the log template, and splitting the combined words in the log template into two or more words;
converting the word into a word vector;
calculating TF-IDF weight of each word vector in the log template in all word vectors converted by the original log sequence;
and calculating semantic vectors of the log data by using all word vectors and corresponding TF-IDF weights in the log template.
The application also provides a log multistage abnormality detection device, which comprises a preprocessing module, a primary detection module and a secondary detection module;
the preprocessing module is used for preprocessing the log data in the original log sequence to obtain a preprocessed log sequence;
the preliminary detection module is used for carrying out preliminary detection on the preprocessed log sequence, wherein the preliminary detection comprises sequential detection;
and the secondary detection module is used for carrying out secondary detection on the preprocessed log sequence when the primary detection result is that no abnormality exists.
Preferably, the secondary detection module is used for performing secondary detection on the preprocessed log sequence by adopting a gating circulation network.
Preferably, the preliminary detection module comprises an extraction module, a prediction module, a judgment module and an abnormality judgment module;
the extraction module is used for extracting a detection sequence from the preprocessed log sequence according to a preset detection window size; the next log of the detection sequence in the preprocessed log sequence is a target log;
the prediction module is used for predicting the probability that the next log of the detection sequence is the target log;
the judging module is used for judging whether the probability is smaller than a threshold value;
and the real-time abnormality judgment module is used for judging that the preprocessed log sequence is an abnormal sequence when the probability is smaller than the threshold value.
Preferably, the preprocessing module comprises an analysis module, a semantic vector acquisition module and a sequencing module;
the analysis module is used for analyzing each log data in the original log sequence to obtain a corresponding log template;
the semantic vector obtaining module is used for carrying out vectorization on each log template to obtain the semantic vector of each log data;
the ordering module is used for ordering all the semantic vectors according to the sequence of the corresponding log data to obtain a preprocessed log sequence.
Preferably, the semantic vector obtaining module comprises a deletion splitting module, a conversion module, a weight calculating module and a weight calculating module;
the deletion splitting module is used for deleting the non-character marks and the pause words in the log template and splitting the combined words in the log template into two or more words;
the conversion module is used for converting words into word vectors;
the weight calculation module is used for calculating TF-IDF weights of all word vectors converted by the original log sequence in each word vector in the log template;
the weighting calculation module is used for calculating semantic vectors of the log data by utilizing all word vectors and corresponding TF-IDF weights in the log template.
Other features of the present application and its advantages will become apparent from the following detailed description of exemplary embodiments of the present application, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow chart of a multi-stage anomaly detection method for logs provided herein;
FIG. 2 is a schematic flow chart of obtaining semantic vectors of log data according to the present application;
fig. 3 is a block diagram of a multistage abnormality detection apparatus for logs provided in the present application.
Detailed Description
Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.
The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods, and apparatus should be considered part of the specification.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.
According to the multi-stage abnormality detection method and device for the logs, the log data are subjected to depth analysis from multiple view angles, multi-stage abnormality detection is achieved, the accuracy of abnormality detection is improved, and the robustness of a model is improved.
As shown in fig. 1, the multi-stage abnormality detection method for a log provided in the present application includes:
s110: preprocessing log data in the original log sequence to obtain a preprocessed log sequence.
As an embodiment, preprocessing log data in an original log sequence to obtain a preprocessed log sequence, which specifically includes:
s1101: and analyzing each log data in the original log sequence to obtain a corresponding log template.
As one example, a common dataset may be downloaded from a GitHub as an original log sequence.
Because the original log sequence is unstructured data and contains a lot of specific information (such as IP addresses, file names, etc.) which can prevent the automatic analysis of the log, each piece of log data needs to be parsed, and the information is extracted by abstracting parameters in the log data, so that the log data becomes structured data for subsequent analysis.
As one embodiment, a Drain algorithm is used to parse log data in an original log sequence to obtain a series of log templates. Table 1 is one embodiment of log data parsing.
TABLE 1
S1102: and vectorizing each log template to obtain the semantic vector of each log data.
As an embodiment, as shown in fig. 2, vectorizing each log template to obtain a semantic vector of each log data, which specifically includes:
p1: preprocessing the log template, deleting the non-character marks and the stop words in the log template, and splitting the combined words in the log template into two or more words.
The log template is regarded as sentences in natural language, and some non-character marks, pause words and combined words exist in the log template, all the non-character marks and pause words are deleted firstly, then the combined words are split, and the combined words are changed into two or more words.
P2: the word is converted into a word vector.
Word vectorization is performed by the FastText algorithm, as one example. FastText can adequately capture word-to-word internal relationships in English sentences, such as semantic similarity, etc.
P3: and calculating TF-IDF weight of each word vector in the log template in all word vectors converted by the original log sequence. The TF-IDF weights can effectively weigh the importance of words in sentences, which meets the high resolution requirement. For example, if the word "Block" frequently appears in a certain log sequence, it means that the word may be more representative in the log sequence. We therefore use Term Frequency (TF) to describe its importance.
In particular, the method comprises the steps of,
where #word is the number of target words in the log sequence, # total is the total number of all words in the log sequence.
On the other hand, if the word "Block" appears in all log sequences, it will become too common to distinguish between these log sequences and should therefore be de-weighted. Therefore, we also use the inverse document frequency (Inverse Document Frequency, IDF) as a metric:
where #L is the total number of all log sequences, and #Lword is the total number of log sequences containing the target word.
For each word, its TF-IDF weights are calculated from tf×idf.
P4: calculating a semantic vector V of log data by using all word vectors and corresponding TF-IDF weights in a log template:
therefore, the semantic vector corresponding to each log data can not only identify log sequences with similar semantics, but also distinguish different log sequences, and the robustness is greatly improved.
S1103: and all the semantic vectors are arranged according to the sequence of the corresponding log data to obtain a preprocessed log sequence.
S120: and (5) performing preliminary detection on the preprocessed log sequence. If the primary detection result is that the abnormality exists, an abnormality alarm is sent out; if the preliminary detection results in no abnormality, S130 is performed.
As one example, the pre-processed log sequence is initially detected by a long and short memory network (Long Short Term Memory, LSTM) model. LSTM is a variant of Recurrent Neural Network (RNN) designed to process and model long-term dependencies in sequence data. Compared with the traditional RNN, the LSTM solves the problems of gradient disappearance, gradient explosion and the like by introducing a gating mechanism, thereby better capturing and memorizing long-term dependency in sequence data.
The procedure of the log is usually performed according to a fixed flow, and thus, the normal log naturally has some sequential pattern. In other words, for a given log sequence, if no anomalies occur, the next log template of the current log template is predictable. Thus, in the present application, the order relation between logs is learned by using the LSTM model.
As one embodiment, the preliminary detection includes sequential detection.
As one embodiment, the sequence detection includes:
s1201: and extracting a detection sequence from the preprocessed log sequence according to a preset detection window size. The next log of the detection sequence in the preprocessed log sequence is the target log.
For the preprocessed log sequence Ω= { v1, v2, …, vn }, W is the window size for detection, S is the step size of the window, the threshold for P prediction. For example, the semantic vector set of one log sequence is [ v3, v1, v4, v6, v1, v7, v3, v5], the detection window size w=3, and the detection sequence (v 3, v1, v 4) is extracted according to the window size. In the log sequence, the target log of the detection sequence (v 3, v1, v 4) is v6.
S1202: predicting the probability that the next log of the detection sequence is the target log.
Specifically, the trained LSTM model is used for prediction, and the probability that the next log of the detection sequence (v 3, v1, v 4) is v6 is p1 is obtained.
S1203: and judging whether the probability is smaller than a threshold value. If yes, executing S1204; otherwise, S1205 is executed.
S1204: if the probability P1 is smaller than the threshold value P, judging that the preprocessed log sequence is an abnormal sequence, and sending out an abnormal alarm.
S1205: if the probability P1 is not smaller than the threshold value P, judging that the log sequence after preprocessing is a normal sequence.
In addition to the sequential pattern, the template (vector) sequence also has a quantitative pattern. Normally, normal program execution has some invariants and always maintains some quantitative relationship in the log under different inputs and workloads. For example, each open file may eventually be closed at some stage. Therefore, the number of logs displaying "open files" should be equal to the number of logs displaying "closed files" under normal conditions. These quantitative relationships in the log can capture normal program execution behavior. If the new log destroys some invariants, it can be determined that an exception occurred during system execution. Thus, in the present application, quantitative relationships between logs are also learned using the LSTM model.
Preferably, the preliminary detection further comprises quantitative relationship detection. If the preprocessed log sequence accords with the quantitative relation, the preprocessed log sequence is judged to be a normal sequence, and the step S130 is continuously executed. Otherwise, judging the preprocessed log sequence as an abnormal sequence, and sending out an abnormal alarm.
S130: and if the result of the primary detection is that no abnormality exists, performing secondary detection on the preprocessed log sequence.
As an embodiment, a gating cycle (Gated Recurrent Unit, GRU) network is used to perform secondary detection on the preprocessed log sequence, that is, the preprocessed log sequence obtained in S110 is input into the GRU network, so as to obtain the probability that the preprocessed log sequence output by the GRU network is an abnormal sequence. If the probability is greater than the threshold value, judging that the preprocessed log sequence is an abnormal sequence, outputting an abnormal result, and sending out an abnormal alarm; otherwise, judging the preprocessed log sequence as a normal sequence, and outputting a normal result.
In the application, in the training stage of the GRU network, on the basis of known normal log sequences (the labels of the log sequences are normal), the label of the unlabeled log sequence in the training set is further estimated by referring to the thought of PU learning (Positive and Unlabelled Learning, positive sample and unlabeled learning), and the clustering is adopted to identify the log sequences with similar semantics.
As one embodiment, training of the GRU network includes the steps of:
q1: all log sequences in the training set (including log sequences with labels (normal labels or abnormal labels) and log sequences without labels) are clustered, so that each clustered group is more likely to contain log sequences with similar semantics.
As one example, clustering is performed using the HDBSCAN algorithm (Hierarchical Density-Based Spatial Clustering of Application with Noise).
Q2: and predicting labels of the log sequence without labels according to the clustering result to obtain an optimized training set.
As one embodiment, each unlabeled log sequence is assigned a probabilistic label by measuring the probability that the unlabeled log sequence belongs to each label to reduce the effects of noise in model training.
Specifically, if a log sequence a with a label exists in a cluster group where a log sequence B without a label exists, a probability that the log sequence B without a label in the same group has the same label as the log sequence a is inferred through the log sequence a. If the log sequence B without the label is in the cluster group with the label, judging by searching the cluster group similar to the cluster group. Thus, each log sequence has a label, so that the optimized training set is more learning-capable.
Q3: and training the GRU network by utilizing the optimized training set to obtain a trained GRU model.
Based on the method, the application also provides a log multi-stage abnormality detection device. As shown in fig. 3, the multi-stage abnormality detection apparatus of the log includes a preprocessing module 310, a primary detection module 320, and a secondary detection module 330.
The preprocessing module 310 is configured to preprocess log data in an original log sequence, and obtain a preprocessed log sequence.
The preliminary detection module 320 is configured to perform preliminary detection on the preprocessed log sequence, where the preliminary detection includes sequential detection.
The secondary detection module 330 is configured to perform secondary detection on the preprocessed log sequence when the result of the primary detection is that no abnormality exists.
Preferably, the secondary detection module 330 is configured to perform secondary detection on the preprocessed log sequence by using the gated loop network.
Preferably, the preliminary detection module 320 includes an extraction module 3201, a prediction module 3202, a judgment module 3203, and an abnormality judgment module 3204.
The extraction module 3201 is configured to extract a detection sequence from the preprocessed log sequence according to a preset detection window size. The next log of the detection sequence in the preprocessed log sequence is the target log.
The prediction module 3202 is configured to predict a probability that a next log of the detection sequence is a target log.
The decision module 3203 is configured to determine whether the probability is less than a threshold.
The real-time anomaly determination module 3204 is configured to determine that the preprocessed log sequence is an anomaly sequence when the probability is less than a threshold.
Preferably, the preprocessing module 310 includes a parsing module 3101, a semantic vector obtaining module 3102, and a ranking module 3103.
The parsing module 3101 is configured to parse each log data in the original log sequence to obtain a corresponding log template.
The semantic vector obtaining module 3102 is configured to vectorize each log template to obtain a semantic vector of each log data.
The sorting module 3103 is configured to sort all semantic vectors according to the sequence of the corresponding log data, and obtain a preprocessed log sequence.
Preferably, the semantic vector obtaining module 3102 includes a delete splitting module, a transform module, a weight calculation module, and a weight calculation module.
The deletion splitting module is used for deleting the non-character marks and the pause words in the log template and splitting the combined words in the log template into two or more words.
The translation module is used for translating words into word vectors.
The weight calculation module is used for calculating TF-IDF weights of all word vectors converted by the original log sequence in each word vector in the log template.
The weighting calculation module is used for calculating semantic vectors of the log data by utilizing all word vectors and corresponding TF-IDF weights in the log template.
Preferably, the multi-stage anomaly detection device is a trained neural network model, at least comprising an LSTM model for primary detection and a GRU network model for secondary detection, and in the training process, the whole model is optimized through the coordinated training of the two models.
To illustrate the detection effect of the present application, the baseline system was compared to the model of the present application using the common dataset HDFS and BGL. Table 2 shows the performance comparison experimental results of the present application with the baseline model on HDFS, and table 3 shows the performance comparison experimental results of the present application with the baseline model on BGL.
TABLE 2 comparative experiments on HDFS
Method Accuracy of Recall rate of recall F1-score
DeepLog 0.953 0.961 0.957
LogAnomaly 0.960 0.940 0.950
LogRobust 0.980 1.000 0.999
CNN 0.946 0.995 0.970
Model herein 0.997 0.998 0.998
Table 3 comparative experiments on BGL
Method Accuracy of Recall rate of recall F1-score
DeepLog 0.900 0.960 0.929
LogAnomaly 0.970 0.940 0.960
LogRobust 0.912 0.964 0.937
CNN 0.966 0.977 0.972
Model herein 0.989 0.977 0.983
From table 2 and table 3, it can be seen that the model of the present application achieves good effects on both the HDFS data set and the BGL data set, and experiments prove the effectiveness of the multi-stage anomaly detection method of the present application. According to the method and the device, the log data are deeply mined from multiple angles, and the utilization rate of information contained in the log data is higher, so that the method and the device have good manifestation on accuracy, recall rate and F1-score, and the problems of low anomaly detection accuracy and poor model robustness in the current large-scale model are solved.
Although specific embodiments of the present application have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present application. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present application. The scope of the application is defined by the appended claims.

Claims (10)

1. A multi-stage anomaly detection method for a log, comprising:
preprocessing log data in an original log sequence to obtain a preprocessed log sequence;
the preprocessed log sequence is subjected to preliminary detection, wherein the preliminary detection comprises sequential detection;
and if the result of the preliminary detection is that no abnormality exists, performing secondary detection on the log sequence after pretreatment.
2. The method of claim 1, wherein the preprocessed log sequence is detected twice using a gated loop network.
3. The multi-stage anomaly detection method of a log of claim 1, wherein the sequential detection comprises:
extracting a detection sequence from the preprocessed log sequence according to a preset detection window size; wherein, the next log of the detection sequence in the preprocessed log sequence is a target log;
predicting the probability that the next log of the detection sequence is the target log;
judging whether the probability is smaller than a threshold value or not;
if yes, the preprocessed log sequence is an abnormal sequence.
4. The method for detecting the multi-stage abnormality of the log according to claim 1, wherein preprocessing is performed on log data in an original log sequence to obtain a preprocessed log sequence, specifically comprising:
analyzing each log data in the original log sequence to obtain a corresponding log template;
vectorizing each log template to obtain a semantic vector of each log data;
and all the semantic vectors are arranged according to the sequence of the corresponding log data to obtain the preprocessed log sequence.
5. The method for multi-stage anomaly detection of logs according to claim 4, wherein vectorizing each log template to obtain a semantic vector for each log data comprises:
deleting the non-character marks and the pause words in the log template, and splitting the combined words in the log template into two or more words;
converting the word into a word vector;
calculating TF-IDF weight of each word vector in the log template in all word vectors converted by the original log sequence;
and calculating semantic vectors of the log data by using all word vectors and corresponding TF-IDF weights in the log template.
6. The multi-stage abnormality detection device for the logs is characterized by comprising a preprocessing module, a primary detection module and a secondary detection module;
the preprocessing module is used for preprocessing log data in the original log sequence to obtain a preprocessed log sequence;
the preliminary detection module is used for carrying out preliminary detection on the preprocessed log sequence, and the preliminary detection comprises sequential detection;
and the secondary detection module is used for carrying out secondary detection on the preprocessed log sequence when the primary detection result is that no abnormality exists.
7. The multi-stage anomaly detection device of claim 6, wherein the secondary detection module is configured to perform secondary detection on the preprocessed log sequence using a gated loop network.
8. The multi-stage abnormality detection device according to claim 6, wherein the preliminary detection module includes an extraction module, a prediction module, a judgment module, and an abnormality judgment module;
the extraction module is used for extracting a detection sequence from the preprocessed log sequence according to a preset detection window size; wherein, the next log of the detection sequence in the preprocessed log sequence is a target log;
the prediction module is used for predicting the probability that the next log of the detection sequence is the target log;
the judging module is used for judging whether the probability is smaller than a threshold value or not;
and the real-time abnormality judgment module is used for judging that the preprocessed log sequence is an abnormal sequence when the probability is smaller than a threshold value.
9. The multi-stage anomaly detection device of claim 6, wherein the preprocessing module comprises an parsing module, a semantic vector obtaining module, and a sorting module;
the analysis module is used for analyzing each log data in the original log sequence to obtain a corresponding log template;
the semantic vector obtaining module is used for carrying out vectorization on each log template to obtain a semantic vector of each log data;
the sorting module is used for arranging all semantic vectors according to the sequence of the corresponding log data to obtain the preprocessed log sequence.
10. The multi-stage anomaly detection device of claim 9, wherein the semantic vector obtaining module comprises a delete splitting module, a conversion module, a weight calculation module, and a weight calculation module;
the deletion splitting module is used for deleting the non-character marks and the pause words in the log template and splitting the combined words in the log template into two or more words;
the conversion module is used for converting the words into word vectors;
the weight calculation module is used for calculating TF-IDF weight of each word vector in the log template in all word vectors converted by the original log sequence;
the weighting calculation module is used for calculating semantic vectors of the log data by utilizing all word vectors and corresponding TF-IDF weights in the log template.
CN202311482357.3A 2023-11-08 2023-11-08 Multi-stage abnormality detection method and device for logs Pending CN117421195A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311482357.3A CN117421195A (en) 2023-11-08 2023-11-08 Multi-stage abnormality detection method and device for logs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311482357.3A CN117421195A (en) 2023-11-08 2023-11-08 Multi-stage abnormality detection method and device for logs

Publications (1)

Publication Number Publication Date
CN117421195A true CN117421195A (en) 2024-01-19

Family

ID=89524734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311482357.3A Pending CN117421195A (en) 2023-11-08 2023-11-08 Multi-stage abnormality detection method and device for logs

Country Status (1)

Country Link
CN (1) CN117421195A (en)

Similar Documents

Publication Publication Date Title
US20220405592A1 (en) Multi-feature log anomaly detection method and system based on log full semantics
CN109697162B (en) Software defect automatic detection method based on open source code library
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN109547423B (en) WEB malicious request deep detection system and method based on machine learning
CN109005145B (en) Malicious URL detection system and method based on automatic feature extraction
EP3819785A1 (en) Feature word determining method, apparatus, and server
CN109918505B (en) Network security event visualization method based on text processing
Paul et al. LeSICiN: a heterogeneous graph-based approach for automatic legal statute identification from Indian legal documents
CN113449099A (en) Text classification method and text classification device
Wang et al. Loguad: log unsupervised anomaly detection based on word2vec
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN112307473A (en) Malicious JavaScript code detection model based on Bi-LSTM network and attention mechanism
CN111585955A (en) HTTP request abnormity detection method and system
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
Zhang et al. Log sequence anomaly detection based on local information extraction and globally sparse transformer model
CN111797241A (en) Event argument extraction method and device based on reinforcement learning
CN110969015B (en) Automatic label identification method and equipment based on operation and maintenance script
CN116910013A (en) System log anomaly detection method based on semantic flowsheet mining
CN117421195A (en) Multi-stage abnormality detection method and device for logs
CN116150376A (en) Sample data distribution optimization method, device and storage medium
CN109992666A (en) Method, apparatus and non-transitory machine readable media for processing feature library
Kuhr et al. Context-specific adaptation of subjective content descriptions
CN113157857A (en) Hot topic detection method, device and equipment for news
CN111061939A (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN111079143A (en) Trojan horse detection method based on multi-dimensional feature map

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination