CN114969761A

CN114969761A - Log anomaly detection method based on LDA theme characteristics

Info

Publication number: CN114969761A
Application number: CN202210689100.4A
Authority: CN
Inventors: 戴华; 孙雪奎; 周建国; 周倩; 杨庚; 陈燕俐
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-08-30

Abstract

The invention discloses a log abnormity detection method based on LDA theme characteristics. The method comprises two stages of model training and anomaly detection. In the model training stage, a log analyzer is used for analyzing a system log into a log template set and a log triple set, wherein the log template set is used for training an LDA (latent dirichlet allocation) model to obtain a log template topic classification model; and converting the log triple into a process log template theme by using an LDA-CM (latent dirichlet allocation-CM) model, further constructing a training sample by using a sliding window mechanism, finally inputting the training sample into an LSTM (least squares metric TM) model, and training to generate a log anomaly detection model. In the abnormal detection stage, the process log to be detected is converted into a corresponding template theme sequence, and then the corresponding template theme sequence is input into an LSTM-ADM model to realize abnormal detection aiming at the process log.

Description

Log anomaly detection method based on LDA theme characteristics

Technical Field

The invention belongs to the technical field of safety, and particularly relates to a log abnormity detection method based on LDA theme characteristics.

Background

In the field of system security, detecting software or system abnormality through logs is a common security protection means. Bugs inevitably exist from simple and small software systems to large and complex software systems, as well as distributed file systems and high-performance cloud computing management platforms, and the bugs can cause the abnormity of the operation of the system. Furthermore, an attacker may also exploit vulnerabilities of software and systems to launch a risky attack to break the system. Therefore, timely and accurate detection of these anomalies is crucial to the construction of a secure and trusted system. However, the existing anomaly detection method cannot accurately learn semantic difference characteristics between normal logs and abnormal logs, so that the generalization capability of the anomaly detection method is poor, and a good effect is not achieved in practical application.

Logs are a common and major source of data for anomaly detection methods in almost all computer systems, and record a series of significant events that describe the state of software and system operation. Existing methods of analyzing system logs to implement anomaly detection can be generalized into four categories: the method comprises a log data counting detection method based on Principal Component Analysis (PCA), a detection method based on a variable mining (IM) capture log recurrence mode, a detection method based on a workflow and a method based on deep learning. The first three types of methods can achieve good results in specific application scenarios, but cannot be used to detect different attacks. The last category of deep learning methods uses log templates for classification to learn patterns of behavior within a log sequence. The current deep learning-based method cannot accurately learn the semantic relation characteristics among logs, and for the injection of a new log template, the stability of the method is greatly influenced, and the method implementation model may fail; in addition, the method has the advantages that the related performances such as precision ratio, recall ratio, harmonic score and the like need to be further improved so as to adapt to complicated and variable software and systems.

Disclosure of Invention

In order to more accurately learn the semantic relation characteristics among logs and more effectively detect the abnormal behaviors of a process or a system through unstructured log records, the invention provides a log abnormality detection method based on LDA theme characteristics.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention relates to a log abnormity detection method based on LDA theme characteristics, which mainly comprises two stages: firstly, a model training stage, namely, constructing a training sample by extracting template subject characteristics of log data, and further training to generate an abnormal detection model; and in the second abnormal detection stage, the abnormal detection model is utilized to realize the detection of the process log.

A model training stage:

(1) acquiring system log data L ═ log ₁ ,log ₂ ,…,log _n And the corresponding process set is set as P ═ P ₁ ,p ₂ ,…,p _f L is generated by the process in P; and processing the logs in the L by using a log analyzer to analyze, and generating a log template set K ═ { K ═ K ₁ ,k ₂ ,…,k _m D and a set of log triples D ═ D ₁ ,d ₂ ,…,d _n In which d is _i Is log _i And (d) corresponding log triples (k, pid and ts), wherein k is a log template, pid is a process identifier, and ts is a time stamp generated by the log.

(2) Preprocessing the log template K, inputting the preprocessed data into a preset theme set to be T ═ T ₁ ,t ₂ ,…,t _r And (4) training the LDA model to generate a log template topic classification model LDA-CM based on LDA.

(3) Initializing a log template topic mapping dictionary TD, and calculating each log template K in K by using an LDA-CM model _i Subject probability distribution vector Θ _i ，Θ _i Is equal to the number of topics in T, Θ _i [j]Represents k _i Belonging to a topic t _j The probability of (d); then, the theta is obtained _i Subject t corresponding to the maximum probability value in (1) _x Will map { k } _i →t _x Add to TD.

(4) Processing the log triple set D according to the processes in the P, establishing log template subject sequences corresponding to the processes in the P by using the TD, and recording the formed sequence set as S ═ S { (S) ₁ ,S ₂ ,…,S _f }. The method comprises the following specific steps:

(4a) dividing D into a plurality of subsets according to the ID of each process in P, namely pid in the log triples, and sequencing the log triples in each subset according to the timestamp ts so as to obtain P for each process _i Generating a corresponding sequence of log triples D _i (ii) a Then, D is obtained _i The log template in each log triple of the process p is obtained _i Corresponding log template sequence

(4b) For each process P in P _i Log template sequence of

Each log template k in (1) _i,j Determining k by using the log template theme mapping relation in TD _i,j Subject t of _x And then is process p _i Establishing a log template topic sequence S _i ＝<t _i,1 ,t _i,2 ,…,t _i,q >. P, the log template topic sequence corresponding to each process forms a set S ═ { S ═ S ₁ ,S ₂ ,…,S _f }。

(5) Using a sliding window mechanism, p for each process in S _i Log template topic sequence S _i And processing to generate a training sample set TP. The method comprises the following specific steps:

(5a) initializing the length of a sliding window to be h, the sliding step length to be 1, and the training sample set TP to be empty;

(5b) for each process p in S _i Log template topic sequence S _i ＝<t _i,1 ,t _i,2 ,…,t _i,q >Is processed if S _i If the number of the log template topics in the log template is less than h, namely q is less than or equal to h, S _i Corresponding journal template topic window set W _i Is empty; otherwise, by moving the sliding window, construct and S _i Corresponding journal template topic window set W _i ＝{w _i,1 ,w _i,2 ,…,w _i,y In which w _i,1 ＝<t _i,1 ,t _i,2 ,…,t _i,h >,w _i,2 ＝<t _i,2 ,t _i,3 ,…,t _i,h+1 >,…,w _i,y ＝<t _i,q-h ,t _i,q-h+1 ,…,t _i,q-1 >Then construct training sample pairs (w) _i,1 ,t _i,h+1 )、(w _i,2 ,t _i,h+2 ) … and (w) _i,y ,t _i,q ) And adds these training sample pairs to the TP.

(6) And training the LSTM model by using the training samples in the TP to generate an LSTM-based process log anomaly detection model LSTM-ADM.

An abnormality detection stage:

(1) the log sequence of the process p to be detected is L _p ＝<log _p,1 ,log _p,2 ,…,log _p,v >Using log parser pair L _p The logs in the process are sequentially processed to generate a log template sequence K of the process _p ＝<k _p,1 ,k _p,2 ,…,k _p,v >。

(2) Mapping dictionary TD with log template theme, and mapping K _p Conversion to a Log template topic sequence S _p The method comprises the following specific steps:

(2a) initialization S _p Is a null sequence;

(2b) sequential treatment of K _p Each log template k in _p,i Checking whether the relation k exists in the log template topic mapping dictionary TD _p,i Subject mapping of { k } _p,i →t _j H, if there is, will t _j Is added to S _p Performing the following steps; otherwise, k is _p,i Inputting LDA-CM model to obtain k _p,i Log template topic probability distribution vector Θ _p,i Obtaining theta _p,i Is not set to theta _p,i [x]Then log template k _p,i Corresponding topic is t _x (ii) a Then t is _x Is added to S _p In (e), a new log template topic map k is created at the same time _p,i →t _x And add the mapping to the TD. Finally obtained with K _p Corresponding log template topic sequence S _p ＝<t _p,1 ,t _p,2 ,…,t _p,v >。

(3) Using a sliding window mechanism, for S _p And processing, carrying out anomaly detection by using an LSTM-ADM model, and returning a detection result. The method comprises the following specific steps:

(3a) initializing the sliding window with the length h and the sliding step length 1, and collecting the subject window set W of the log template _p Null, detection pair set DP null;

(3b) judgment S _p If the former is not larger than the latter, namely v is not larger than h, the abnormal detection of the process p is finished; otherwise, the next step is carried out continuously;

(3c) by moving the sliding window, the structure and S _p Corresponding journal template topic window set W _p ＝{w _p,1 ,w _p,2 ,…,w _p,y In which w _p,1 ＝<t _p,1 ,t _p,2 ,…,t _p,h >,w _p,2 ＝<t _p,2 ,t _p,3 ,…,t _p,h+1 >,…,w _p,y ＝<t _p,v-h ,t _p,v-h+1 ,…,t _p,v-1 >Then, a set of detection pairs DP { (w) is constructed _p,1 ,t _p,h+1 ),(w _p,2 ,t _p,h+2 ),…,(w _p,y ,t _p,v )}；

(3d) Detecting pairs (w) for each of the DPs _p,i ,t _p,h+i ) Subject window w of log template _p,i Inputting LSTM-ADM to obtain window w _p,i Predicted log template topic probability distribution vector V for the next log _p,i ，V _p,i Is equal to the number of topics in T, V _p,i [j]Representing a window w _p,i Subject of the predicted log template of the next log of (1) belongs to the subject t _j The probability of (d); then, V is obtained _p,i The topics corresponding to the g maximum probability values in the prediction log form a prediction log template topic set CS _p,i If it is determined that

The process p is abnormal, and the detection is finished;

(3e) and when all detection pairs in the DP are processed and no abnormality is detected, the process p has no abnormality and the detection is finished.

The invention has the beneficial effects that:

the invention provides a log anomaly detection method based on LDA theme characteristics for the first time, and the method can extract the characteristics of the log and convert the log into a log template theme, thereby overcoming the defects of the existing anomaly detection method based on the log template.

The LDA theme model used by the method is an unsupervised model, only log template data is needed to be used as a corpus, the number of themes is specified, training can be completed without labels to obtain an LDA-CM theme classification model, and the method is easy to realize;

in addition, the LDA-CM topic classification model can match the newly added log template to the most relevant log template topic, so that the problem of model robustness of the existing method for injecting the new log template is solved.

Drawings

FIG. 1 is a diagram of log data preprocessing according to the present invention.

FIG. 2 is a flowchart of an overall framework of the LDA topic feature-based log anomaly detection method of the present invention.

Detailed Description

In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.

As shown in fig. 2, the present invention is a log anomaly detection method based on LDA topic features, which includes two stages:

a model training stage: in the model training stage, firstly, a log analyzer is used for analyzing a system log into a log template set and a log triple set, wherein the log template set is used for training an LDA (latent dirichlet allocation) model to obtain a log template topic classification model LDA-CM (latent dirichlet allocation-CM); then, converting the log triple into a process log template theme by using an LDA-CM (latent dirichlet allocation-CM) model, further constructing a training sample by using a sliding window mechanism, finally inputting the training sample into an LSTM model, and training to generate a log anomaly detection model LSTM-ADM;

an abnormality detection stage: in the anomaly detection stage, firstly, the process log to be detected is converted into a corresponding template topic sequence, and then the LSTM-ADM model in the step 1 is input to realize anomaly detection aiming at the process log.

The invention is further illustrated below with reference to the accompanying figures 1-2.

The log abnormity detection method based on the LDA theme characteristics, provided by the invention, comprises the following steps:

in the model training phase:

(1) acquiring system log data L ═ log ₁ ,log ₂ ,…,log _n P, process set P ═ P ₁ ,p ₂ ,…,p _f L is generated by the process in P; and processing the logs in the L by using a log analyzer to analyze, and generating a log template set K ═ { K ═ K ₁ ,k ₂ ,…,k _m D and a set of log triples D ═ D ₁ ,d ₂ ,…,d _n In which d is _i Is log _i And (d) corresponding log triples (k, pid and ts), wherein k is a log template, pid is a process identifier, and ts is a time stamp generated by the log.

(2) Splitting words of each log template in the log template K to obtain a word list WL, carrying out lowercase conversion on each word of the WL, filtering stop words and semantic-free identifiers, finally converting the word list into a corpus by using a tape model, adding the corpus into a corpus list CL, and inputting the corpus CL into a preset theme set to obtain T ═ T ₁ ,t ₂ ,…,t _r And (4) training the LDA model to generate a log template topic classification model LDA-CM based on LDA.

(4b) For each process P in P _i Log template sequence of

In the anomaly detection phase:

(2) Mapping the dictionary TD with the log template theme, and matching K _p Conversion to a Log template topic sequence S _p . The method comprises the following specific steps:

(2a) initialization S _p Is a null sequence;

(2b) sequential treatment of K _p Each log template k in _p,i In the log template topic mapping dictionary TD, whether the k exists or not is inquired _p,i Subject mapping of { k } _p,i →t _j H, if present, will t _j Is added to S _p The preparation method comprises the following steps of (1) performing; otherwise, k is _p,i Inputting LDA-CM model to obtain k _p,i Log template topic probability distribution vector Θ _p,i Obtaining theta _p,i Maximum probability value Θ in (1) _p,i [x]Then log template k _p,i Corresponding topic is t _x (ii) a Then t is _x Is added to S _p In (2), a new log template topic map k is created simultaneously _p,i →t _x And add the mapping to the TD. Finally obtained with K _p Corresponding log template topic sequence S _p ＝<t _p,1 ,t _p,2 ,…,t _p,v >。

(3) Using a sliding window mechanism, for S _p And processing, performing anomaly detection by using an LSTM-ADM model, and returning a detection result. The method comprises the following specific steps:

(3c) by moving the sliding window, the structure and S _p Corresponding journal template topic window set W _p ＝{w _p,1 ,w _p,2 ,…,w _p,y In which w _p,1 ＝<t _p,1 ,t _p,2 ,…,t _p,h >,w _p,2 ＝<t _p,2 ,t _p,3 ,…,t _p,h+1 >,…,w _p,y ＝<t _p,v-h ,t _p,v-h+1 ,…,t _p,v-1 >Generating a set of detection pairs DP { (w) _p,1 ,t _p,h+1 ),(w _p,2 ,t _p,h+2 ),…,(w _p,y ,t _p,v )}；

(3d) Detecting pairs (w) for each of the DPs _p,i ,t _p,h+i ) Subject window w of log template _p,i Inputting LSTM-ADM to obtain window w _p,i Predicted log template topic probability distribution vector V for the next log _p,i ，V _p,i Is equal to the number of topics in T, V _p,i [j]Representing a window w _p,i Subject of the predicted log template of the next log of (1) belongs to the subject t _j The probability of (d); then, V is obtained _p,i The topics corresponding to the g maximum probability values in the prediction log form a prediction log template topic set CS _p,i If, if

The process p is abnormal, and the detection is finished; (3e) and when all detection pairs in the DP are processed and no abnormality is detected, the process p has no abnormality and the detection is finished.

Therefore, the anomaly detection method based on the LDA subject characteristics has better robustness and realizability.

The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A log abnormity detection method based on LDA subject characteristics is characterized in that: the log anomaly detection method comprises the following steps:

step 1, model training: in the model training stage, firstly, a log analyzer is used for analyzing a system log into a log template set and a log triple set, wherein the log template set is used for training an LDA (latent dirichlet allocation) model to obtain a log template topic classification model LDA-CM (latent dirichlet allocation-CM); then, converting the log triple into a process log template theme by using an LDA-CM (latent dirichlet allocation-CM) model, further constructing a training sample by using a sliding window mechanism, finally inputting the training sample into an LSTM model, and training to generate a log anomaly detection model LSTM-ADM;

step 2, abnormality detection: in the anomaly detection stage, firstly, the process log to be detected is converted into a corresponding template topic sequence, and then the LSTM-ADM model in the step 1 is input to realize anomaly detection aiming at the process log.

2. The method for detecting log abnormality based on LDA subject characteristics as claimed in claim 1, wherein: the step 1 of model training specifically comprises the following steps:

step 1-1: acquiring system log data L ═ log ₁ ,log ₂ ,…,log _n And the corresponding process set is set as P ═ P ₁ ,p ₂ ,…,p _f And L is generated by the process in P, the log in L is processed by using the log resolver to resolve, and a log template set K is generated as { K ═ K ₁ ,k ₂ ,…,k _m And a set of log triples D corresponding to L ═ D ₁ ,d ₂ ,…,d _n In which d is _i Is log _i Corresponding log triple (k, pid, ts), wherein k is a log template, pid is a process identifier, and ts is a time stamp generated by the log;

step 1-2: preprocessing a log template set K, inputting preprocessed data into a preset subject set to be T ═ T ₁ ,t ₂ ,…,t _r Training the LDA model to generate a log template topic classification model LDA-CM based on LDA;

step 1-3: initializing a log template topic mapping dictionary TD, and calculating each log template K in a log template set K by using the LDA-CM model generated in the step 1-2 _i Subject probability distribution vector Θ _i ，Θ _i Is equal to the number of topics in T, Θ _i [j]Represents k _i Belonging to a topic t _j Then, obtain Θ _i Subject t corresponding to the maximum probability value in (1) _x Will map { k } _i →t _x Addition to TD;

step 1-4: processing the log triple set D according to the processes in the process set P, establishing log template topic sequences corresponding to the processes in the process set P by using a log template topic mapping dictionary TD, and recording the formed sequence set as S ═ S { (S) ₁ ,S ₂ ,…,S _f }；

Step 1-5: using a sliding window mechanism, for each process p in the sequence set S formed in steps 1-4 _i Log template topic sequence S _i Processing to generate a training sample set TP;

1-6: and (3) training an LSTM model by using the training samples in the training sample set TP generated in the step (1-5) to generate an LSTM-based process log anomaly detection model LSTM-ADM.

3. The method for detecting log abnormality based on LDA subject feature of claim 2, wherein: in the steps 1 to 4, the log triple set D is processed, and the specific steps are as follows:

step 1-4-1: dividing D into a plurality of subsets according to the ID of each process in the process-oriented set P, namely pid in the log triples, and sequencing the log triples in each subset according to the timestamp ts, thereby providing each process P _i Constructing a corresponding log triple sequence D _i Then, D is obtained _i The log template in each log triple of the process p is obtained _i Corresponding log template sequence

Step 1-4-2: for each process P in the set of Processes P _i Log template sequence of

Each log template k in (1) _i,j Determining k by using the log template topic mapping relation in the log template topic mapping dictionary TD _i,j Subject t of _x And then is process p _i Establishing a log template topic sequence S _i ＝<t _i,1 ,t _i,2 ,…,t _i,q >And the log template topic sequence corresponding to each process in the P forms a set S ═ { S ═ S ₁ ,S ₂ ,…,S _f }。

4. The method for detecting log abnormality based on LDA subject feature of claim 2, wherein: the specific steps of generating the training sample set TP in the steps 1-5 are as follows:

step 1-5-1: initializing the length of a sliding window to be h, the sliding step length to be 1, and the training sample set TP to be empty;

step 1-5-2: for each process p in the sequence set S _i Log template topic sequence S _i ＝<t _i,1 ,t _i,2 ,…,t _i,q >Is processed if S _i If the number of the log template topics in the log template is less than h, namely q is less than or equal to h, S _i Corresponding journal template topic window set W _i Is empty; otherwise, by moving the sliding window, construct and S _i Corresponding journal template topic window set W _i ＝{w _i,1 ,w _i,2 ,…,w _i,y In which w _i,1 ＝<t _i,1 ,t _i,2 ,…,t _i,h >,w _i,2 ＝<t _i,2 ,t _i,3 ,…,t _i,h+1 >,…,w _i,y ＝<t _i,q-h ,t _i,q-h+1 ,…,t _i,q-1 >Then construct training sample pairs (w) _i,1 ,t _i,h+1 )、(w _i,2 ,t _i,h+2 ) … and (w) _i,y ,t _i,q ) And adds these training sample pairs to the TP.

5. The method for detecting log abnormality based on LDA subject characteristics as claimed in claim 1, wherein: the abnormality detection of step 2 specifically includes the steps of:

step 2-1: the log sequence of the process p to be detected is L _p ＝<log _p,1 ,log _p,2 ,…,log _p,v >Using log parser pair L _p The logs in the process are sequentially processed to generate a log template sequence K of the process _p ＝<k _p,1 ,k _p,2 ,…,k _p,v >；

Step 2-2: mapping dictionary TD by using log template theme, and mapping log template sequence K _p Conversion to a Log template topic sequence S _p ＝<t _p,1 ,t _p,2 ,…,t _p,v >；

Step 2-3: using sliding window mechanism to S in step 2-2 _p ＝<t _p,1 ,t _p,2 ,…,t _p,v >And processing, carrying out anomaly detection by using an LSTM-ADM model, and returning a detection result.

6. The log anomaly detection method based on LDA subject characteristics as claimed in claim 5, wherein: step 2-2 is to log template sequence K _p Conversion to a Log template topic sequence S _p The method comprises the following specific steps:

step 2-2-1: initialization S _p Is a null sequence;

step 2-2-2: sequential treatment of K _p Each log template k in _p,i In the log template topic mapping dictionary TD, whether the k exists or not is inquired _p,i Subject mapping of { k } _p,i →t _j H, if there is, will t _j Is added to S _p The preparation method comprises the following steps of (1) performing; otherwise, k is added _p,i Inputting LDA-CM model to obtain k _p,i Log template topic probability distribution vector Θ _p,i Obtaining theta _p,i Is not set to theta _p,i [x]Then log template k _p,i Corresponding topic is t _x Then t is added _x Is added to S _p In (e), a new log template topic map k is created at the same time _p,i →t _x And adding the mapping into TD to finally obtain K _p Corresponding log template topic sequence S _p ＝<t _p,1 ,t _p,2 ,…,t _p,v >。

7. The log anomaly detection method based on LDA subject characteristics as claimed in claim 5, wherein: subject sequence S by log template for process p in step 2-3 _p ＝<t _p,1 ,t _p,2 ,…,t _p,v >The method realizes the process log abnormity detection, and comprises the following specific steps:

step 2-3-1: initializing the sliding window with the length h and the sliding step length 1, and collecting the subject window set W of the log template _p Null, detection pair set DP null;

step 2-3-2: judgment S _p If the number of the log template topics is not more than the length of the sliding window, namely v is not more than h, the abnormal detection of the process p is finished; otherwise, the next step is carried out continuously;

step 2-3-3: by moving the sliding window, the structure and S _p Corresponding journal template topic window set W _p ＝{w _p,1 ,w _p,2 ,…,w _p,y In which w _p,1 ＝<t _p,1 ,t _p,2 ,…,t _p,h >,w _p,2 ＝<t _p,2 ,t _p,3 ,…,t _p,h+1 >,…,w _p,y ＝<t _p,v-h ,t _p,v -h+1,…,t _p,v-1 >Then, a set of detection pairs DP { (w) is constructed _p,1 ,t _p,h+1 ),(w _p,2 ,t _p,h+2 ),…,(w _p,y ,t _p,v )}；

Step 2-3-4: detecting pairs (w) for each of the DPs _p,i ,t _p,h+i ) Subject window w of log template _p,i Inputting a log abnormity detection model LSTM-ADM to obtain a window w _p,i Predicted log template topic probability distribution vector V for the next log _p,i ，V _p,i Is equal to the number of topics in T, V _p,i [j]Representing a window w _p,i Subject of the predicted log template of the next log of (1) belongs to the subject t _j Then, V is obtained _p,i The topics corresponding to the g maximum probability values in the prediction log form a prediction log template topic set CS _p,i If, if

The process p is abnormal, and the detection is finished;

step 2-3-5: and when all detection pairs in the DP are processed and no abnormality is detected, the process p has no abnormality and the detection is finished.