CN114969761A - Log anomaly detection method based on LDA theme characteristics - Google Patents

Log anomaly detection method based on LDA theme characteristics Download PDF

Info

Publication number
CN114969761A
CN114969761A CN202210689100.4A CN202210689100A CN114969761A CN 114969761 A CN114969761 A CN 114969761A CN 202210689100 A CN202210689100 A CN 202210689100A CN 114969761 A CN114969761 A CN 114969761A
Authority
CN
China
Prior art keywords
log
template
topic
model
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210689100.4A
Other languages
Chinese (zh)
Inventor
戴华
孙雪奎
周建国
周倩
杨庚
陈燕俐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210689100.4A priority Critical patent/CN114969761A/en
Publication of CN114969761A publication Critical patent/CN114969761A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Virology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a log abnormity detection method based on LDA theme characteristics. The method comprises two stages of model training and anomaly detection. In the model training stage, a log analyzer is used for analyzing a system log into a log template set and a log triple set, wherein the log template set is used for training an LDA (latent dirichlet allocation) model to obtain a log template topic classification model; and converting the log triple into a process log template theme by using an LDA-CM (latent dirichlet allocation-CM) model, further constructing a training sample by using a sliding window mechanism, finally inputting the training sample into an LSTM (least squares metric TM) model, and training to generate a log anomaly detection model. In the abnormal detection stage, the process log to be detected is converted into a corresponding template theme sequence, and then the corresponding template theme sequence is input into an LSTM-ADM model to realize abnormal detection aiming at the process log.

Description

Log anomaly detection method based on LDA theme characteristics
Technical Field
The invention belongs to the technical field of safety, and particularly relates to a log abnormity detection method based on LDA theme characteristics.
Background
In the field of system security, detecting software or system abnormality through logs is a common security protection means. Bugs inevitably exist from simple and small software systems to large and complex software systems, as well as distributed file systems and high-performance cloud computing management platforms, and the bugs can cause the abnormity of the operation of the system. Furthermore, an attacker may also exploit vulnerabilities of software and systems to launch a risky attack to break the system. Therefore, timely and accurate detection of these anomalies is crucial to the construction of a secure and trusted system. However, the existing anomaly detection method cannot accurately learn semantic difference characteristics between normal logs and abnormal logs, so that the generalization capability of the anomaly detection method is poor, and a good effect is not achieved in practical application.
Logs are a common and major source of data for anomaly detection methods in almost all computer systems, and record a series of significant events that describe the state of software and system operation. Existing methods of analyzing system logs to implement anomaly detection can be generalized into four categories: the method comprises a log data counting detection method based on Principal Component Analysis (PCA), a detection method based on a variable mining (IM) capture log recurrence mode, a detection method based on a workflow and a method based on deep learning. The first three types of methods can achieve good results in specific application scenarios, but cannot be used to detect different attacks. The last category of deep learning methods uses log templates for classification to learn patterns of behavior within a log sequence. The current deep learning-based method cannot accurately learn the semantic relation characteristics among logs, and for the injection of a new log template, the stability of the method is greatly influenced, and the method implementation model may fail; in addition, the method has the advantages that the related performances such as precision ratio, recall ratio, harmonic score and the like need to be further improved so as to adapt to complicated and variable software and systems.
Disclosure of Invention
In order to more accurately learn the semantic relation characteristics among logs and more effectively detect the abnormal behaviors of a process or a system through unstructured log records, the invention provides a log abnormality detection method based on LDA theme characteristics.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the invention relates to a log abnormity detection method based on LDA theme characteristics, which mainly comprises two stages: firstly, a model training stage, namely, constructing a training sample by extracting template subject characteristics of log data, and further training to generate an abnormal detection model; and in the second abnormal detection stage, the abnormal detection model is utilized to realize the detection of the process log.
A model training stage:
(1) acquiring system log data L ═ log 1 ,log 2 ,…,log n And the corresponding process set is set as P ═ P 1 ,p 2 ,…,p f L is generated by the process in P; and processing the logs in the L by using a log analyzer to analyze, and generating a log template set K ═ { K ═ K 1 ,k 2 ,…,k m D and a set of log triples D ═ D 1 ,d 2 ,…,d n In which d is i Is log i And (d) corresponding log triples (k, pid and ts), wherein k is a log template, pid is a process identifier, and ts is a time stamp generated by the log.
(2) Preprocessing the log template K, inputting the preprocessed data into a preset theme set to be T ═ T 1 ,t 2 ,…,t r And (4) training the LDA model to generate a log template topic classification model LDA-CM based on LDA.
(3) Initializing a log template topic mapping dictionary TD, and calculating each log template K in K by using an LDA-CM model i Subject probability distribution vector Θ i ,Θ i Is equal to the number of topics in T, Θ i [j]Represents k i Belonging to a topic t j The probability of (d); then, the theta is obtained i Subject t corresponding to the maximum probability value in (1) x Will map { k } i →t x Add to TD.
(4) Processing the log triple set D according to the processes in the P, establishing log template subject sequences corresponding to the processes in the P by using the TD, and recording the formed sequence set as S ═ S { (S) 1 ,S 2 ,…,S f }. The method comprises the following specific steps:
(4a) dividing D into a plurality of subsets according to the ID of each process in P, namely pid in the log triples, and sequencing the log triples in each subset according to the timestamp ts so as to obtain P for each process i Generating a corresponding sequence of log triples D i (ii) a Then, D is obtained i The log template in each log triple of the process p is obtained i Corresponding log template sequence
Figure BDA0003700917110000021
Figure BDA0003700917110000022
(4b) For each process P in P i Log template sequence of
Figure BDA0003700917110000023
Each log template k in (1) i,j Determining k by using the log template theme mapping relation in TD i,j Subject t of x And then is process p i Establishing a log template topic sequence S i =<t i,1 ,t i,2 ,…,t i,q >. P, the log template topic sequence corresponding to each process forms a set S ═ { S ═ S 1 ,S 2 ,…,S f }。
(5) Using a sliding window mechanism, p for each process in S i Log template topic sequence S i And processing to generate a training sample set TP. The method comprises the following specific steps:
(5a) initializing the length of a sliding window to be h, the sliding step length to be 1, and the training sample set TP to be empty;
(5b) for each process p in S i Log template topic sequence S i =<t i,1 ,t i,2 ,…,t i,q >Is processed if S i If the number of the log template topics in the log template is less than h, namely q is less than or equal to h, S i Corresponding journal template topic window set W i Is empty; otherwise, by moving the sliding window, construct and S i Corresponding journal template topic window set W i ={w i,1 ,w i,2 ,…,w i,y In which w i,1 =<t i,1 ,t i,2 ,…,t i,h >,w i,2 =<t i,2 ,t i,3 ,…,t i,h+1 >,…,w i,y =<t i,q-h ,t i,q-h+1 ,…,t i,q-1 >Then construct training sample pairs (w) i,1 ,t i,h+1 )、(w i,2 ,t i,h+2 ) … and (w) i,y ,t i,q ) And adds these training sample pairs to the TP.
(6) And training the LSTM model by using the training samples in the TP to generate an LSTM-based process log anomaly detection model LSTM-ADM.
An abnormality detection stage:
(1) the log sequence of the process p to be detected is L p =<log p,1 ,log p,2 ,…,log p,v >Using log parser pair L p The logs in the process are sequentially processed to generate a log template sequence K of the process p =<k p,1 ,k p,2 ,…,k p,v >。
(2) Mapping dictionary TD with log template theme, and mapping K p Conversion to a Log template topic sequence S p The method comprises the following specific steps:
(2a) initialization S p Is a null sequence;
(2b) sequential treatment of K p Each log template k in p,i Checking whether the relation k exists in the log template topic mapping dictionary TD p,i Subject mapping of { k } p,i →t j H, if there is, will t j Is added to S p Performing the following steps; otherwise, k is p,i Inputting LDA-CM model to obtain k p,i Log template topic probability distribution vector Θ p,i Obtaining theta p,i Is not set to theta p,i [x]Then log template k p,i Corresponding topic is t x (ii) a Then t is x Is added to S p In (e), a new log template topic map k is created at the same time p,i →t x And add the mapping to the TD. Finally obtained with K p Corresponding log template topic sequence S p =<t p,1 ,t p,2 ,…,t p,v >。
(3) Using a sliding window mechanism, for S p And processing, carrying out anomaly detection by using an LSTM-ADM model, and returning a detection result. The method comprises the following specific steps:
(3a) initializing the sliding window with the length h and the sliding step length 1, and collecting the subject window set W of the log template p Null, detection pair set DP null;
(3b) judgment S p If the former is not larger than the latter, namely v is not larger than h, the abnormal detection of the process p is finished; otherwise, the next step is carried out continuously;
(3c) by moving the sliding window, the structure and S p Corresponding journal template topic window set W p ={w p,1 ,w p,2 ,…,w p,y In which w p,1 =<t p,1 ,t p,2 ,…,t p,h >,w p,2 =<t p,2 ,t p,3 ,…,t p,h+1 >,…,w p,y =<t p,v-h ,t p,v-h+1 ,…,t p,v-1 >Then, a set of detection pairs DP { (w) is constructed p,1 ,t p,h+1 ),(w p,2 ,t p,h+2 ),…,(w p,y ,t p,v )};
(3d) Detecting pairs (w) for each of the DPs p,i ,t p,h+i ) Subject window w of log template p,i Inputting LSTM-ADM to obtain window w p,i Predicted log template topic probability distribution vector V for the next log p,i ,V p,i Is equal to the number of topics in T, V p,i [j]Representing a window w p,i Subject of the predicted log template of the next log of (1) belongs to the subject t j The probability of (d); then, V is obtained p,i The topics corresponding to the g maximum probability values in the prediction log form a prediction log template topic set CS p,i If it is determined that
Figure BDA0003700917110000041
Figure BDA0003700917110000042
The process p is abnormal, and the detection is finished;
(3e) and when all detection pairs in the DP are processed and no abnormality is detected, the process p has no abnormality and the detection is finished.
The invention has the beneficial effects that:
the invention provides a log anomaly detection method based on LDA theme characteristics for the first time, and the method can extract the characteristics of the log and convert the log into a log template theme, thereby overcoming the defects of the existing anomaly detection method based on the log template.
The LDA theme model used by the method is an unsupervised model, only log template data is needed to be used as a corpus, the number of themes is specified, training can be completed without labels to obtain an LDA-CM theme classification model, and the method is easy to realize;
in addition, the LDA-CM topic classification model can match the newly added log template to the most relevant log template topic, so that the problem of model robustness of the existing method for injecting the new log template is solved.
Drawings
FIG. 1 is a diagram of log data preprocessing according to the present invention.
FIG. 2 is a flowchart of an overall framework of the LDA topic feature-based log anomaly detection method of the present invention.
Detailed Description
In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.
As shown in fig. 2, the present invention is a log anomaly detection method based on LDA topic features, which includes two stages:
a model training stage: in the model training stage, firstly, a log analyzer is used for analyzing a system log into a log template set and a log triple set, wherein the log template set is used for training an LDA (latent dirichlet allocation) model to obtain a log template topic classification model LDA-CM (latent dirichlet allocation-CM); then, converting the log triple into a process log template theme by using an LDA-CM (latent dirichlet allocation-CM) model, further constructing a training sample by using a sliding window mechanism, finally inputting the training sample into an LSTM model, and training to generate a log anomaly detection model LSTM-ADM;
an abnormality detection stage: in the anomaly detection stage, firstly, the process log to be detected is converted into a corresponding template topic sequence, and then the LSTM-ADM model in the step 1 is input to realize anomaly detection aiming at the process log.
The invention is further illustrated below with reference to the accompanying figures 1-2.
The log abnormity detection method based on the LDA theme characteristics, provided by the invention, comprises the following steps:
in the model training phase:
(1) acquiring system log data L ═ log 1 ,log 2 ,…,log n P, process set P ═ P 1 ,p 2 ,…,p f L is generated by the process in P; and processing the logs in the L by using a log analyzer to analyze, and generating a log template set K ═ { K ═ K 1 ,k 2 ,…,k m D and a set of log triples D ═ D 1 ,d 2 ,…,d n In which d is i Is log i And (d) corresponding log triples (k, pid and ts), wherein k is a log template, pid is a process identifier, and ts is a time stamp generated by the log.
(2) Splitting words of each log template in the log template K to obtain a word list WL, carrying out lowercase conversion on each word of the WL, filtering stop words and semantic-free identifiers, finally converting the word list into a corpus by using a tape model, adding the corpus into a corpus list CL, and inputting the corpus CL into a preset theme set to obtain T ═ T 1 ,t 2 ,…,t r And (4) training the LDA model to generate a log template topic classification model LDA-CM based on LDA.
(3) Initializing a log template topic mapping dictionary TD, and calculating each log template K in K by using an LDA-CM model i Subject probability distribution vector Θ i ,Θ i Is equal to the number of topics in T, Θ i [j]Represents k i Belonging to a topic t j The probability of (d); then, the theta is obtained i Subject t corresponding to the maximum probability value in (1) x Will map { k } i →t x Add to TD.
(4) Processing the log triple set D according to the processes in the P, establishing log template subject sequences corresponding to the processes in the P by using the TD, and recording the formed sequence set as S ═ S { (S) 1 ,S 2 ,…,S f }. The method comprises the following specific steps:
(4a) dividing D into a plurality of subsets according to the ID of each process in P, namely pid in the log triples, and sequencing the log triples in each subset according to the timestamp ts so as to obtain P for each process i Generating a corresponding sequence of log triples D i (ii) a Then, D is obtained i The log template in each log triple of the process p is obtained i Corresponding log template sequence
Figure BDA0003700917110000061
Figure BDA0003700917110000062
(4b) For each process P in P i Log template sequence of
Figure BDA0003700917110000063
Each log template k in (1) i,j Determining k by using the log template theme mapping relation in TD i,j Subject t of x And then is process p i Establishing a log template topic sequence S i =<t i,1 ,t i,2 ,…,t i,q >. P, the log template topic sequence corresponding to each process forms a set S ═ { S ═ S 1 ,S 2 ,…,S f }。
(5) Using a sliding window mechanism, p for each process in S i Log template topic sequence S i And processing to generate a training sample set TP. The method comprises the following specific steps:
(5a) initializing the length of a sliding window to be h, the sliding step length to be 1, and the training sample set TP to be empty;
(5b) for each process p in S i Log template topic sequence S i =<t i,1 ,t i,2 ,…,t i,q >Is processed if S i If the number of the log template topics in the log template is less than h, namely q is less than or equal to h, S i Corresponding journal template topic window set W i Is empty; otherwise, by moving the sliding window, construct and S i Corresponding journal template topic window set W i ={w i,1 ,w i,2 ,…,w i,y In which w i,1 =<t i,1 ,t i,2 ,…,t i,h >,w i,2 =<t i,2 ,t i,3 ,…,t i,h+1 >,…,w i,y =<t i,q-h ,t i,q-h+1 ,…,t i,q-1 >Then construct training sample pairs (w) i,1 ,t i,h+1 )、(w i,2 ,t i,h+2 ) … and (w) i,y ,t i,q ) And adds these training sample pairs to the TP.
(6) And training the LSTM model by using the training samples in the TP to generate an LSTM-based process log anomaly detection model LSTM-ADM.
In the anomaly detection phase:
(1) the log sequence of the process p to be detected is L p =<log p,1 ,log p,2 ,…,log p,v >Using log parser pair L p The logs in the process are sequentially processed to generate a log template sequence K of the process p =<k p,1 ,k p,2 ,…,k p,v >。
(2) Mapping the dictionary TD with the log template theme, and matching K p Conversion to a Log template topic sequence S p . The method comprises the following specific steps:
(2a) initialization S p Is a null sequence;
(2b) sequential treatment of K p Each log template k in p,i In the log template topic mapping dictionary TD, whether the k exists or not is inquired p,i Subject mapping of { k } p,i →t j H, if present, will t j Is added to S p The preparation method comprises the following steps of (1) performing; otherwise, k is p,i Inputting LDA-CM model to obtain k p,i Log template topic probability distribution vector Θ p,i Obtaining theta p,i Maximum probability value Θ in (1) p,i [x]Then log template k p,i Corresponding topic is t x (ii) a Then t is x Is added to S p In (2), a new log template topic map k is created simultaneously p,i →t x And add the mapping to the TD. Finally obtained with K p Corresponding log template topic sequence S p =<t p,1 ,t p,2 ,…,t p,v >。
(3) Using a sliding window mechanism, for S p And processing, performing anomaly detection by using an LSTM-ADM model, and returning a detection result. The method comprises the following specific steps:
(3a) initializing the sliding window with the length h and the sliding step length 1, and collecting the subject window set W of the log template p Null, detection pair set DP null;
(3b) judgment S p If the former is not larger than the latter, namely v is not larger than h, the abnormal detection of the process p is finished; otherwise, the next step is carried out continuously;
(3c) by moving the sliding window, the structure and S p Corresponding journal template topic window set W p ={w p,1 ,w p,2 ,…,w p,y In which w p,1 =<t p,1 ,t p,2 ,…,t p,h >,w p,2 =<t p,2 ,t p,3 ,…,t p,h+1 >,…,w p,y =<t p,v-h ,t p,v-h+1 ,…,t p,v-1 >Generating a set of detection pairs DP { (w) p,1 ,t p,h+1 ),(w p,2 ,t p,h+2 ),…,(w p,y ,t p,v )};
(3d) Detecting pairs (w) for each of the DPs p,i ,t p,h+i ) Subject window w of log template p,i Inputting LSTM-ADM to obtain window w p,i Predicted log template topic probability distribution vector V for the next log p,i ,V p,i Is equal to the number of topics in T, V p,i [j]Representing a window w p,i Subject of the predicted log template of the next log of (1) belongs to the subject t j The probability of (d); then, V is obtained p,i The topics corresponding to the g maximum probability values in the prediction log form a prediction log template topic set CS p,i If, if
Figure BDA0003700917110000071
Figure BDA0003700917110000072
The process p is abnormal, and the detection is finished; (3e) and when all detection pairs in the DP are processed and no abnormality is detected, the process p has no abnormality and the detection is finished.
Therefore, the anomaly detection method based on the LDA subject characteristics has better robustness and realizability.
The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (7)

1. A log abnormity detection method based on LDA subject characteristics is characterized in that: the log anomaly detection method comprises the following steps:
step 1, model training: in the model training stage, firstly, a log analyzer is used for analyzing a system log into a log template set and a log triple set, wherein the log template set is used for training an LDA (latent dirichlet allocation) model to obtain a log template topic classification model LDA-CM (latent dirichlet allocation-CM); then, converting the log triple into a process log template theme by using an LDA-CM (latent dirichlet allocation-CM) model, further constructing a training sample by using a sliding window mechanism, finally inputting the training sample into an LSTM model, and training to generate a log anomaly detection model LSTM-ADM;
step 2, abnormality detection: in the anomaly detection stage, firstly, the process log to be detected is converted into a corresponding template topic sequence, and then the LSTM-ADM model in the step 1 is input to realize anomaly detection aiming at the process log.
2. The method for detecting log abnormality based on LDA subject characteristics as claimed in claim 1, wherein: the step 1 of model training specifically comprises the following steps:
step 1-1: acquiring system log data L ═ log 1 ,log 2 ,…,log n And the corresponding process set is set as P ═ P 1 ,p 2 ,…,p f And L is generated by the process in P, the log in L is processed by using the log resolver to resolve, and a log template set K is generated as { K ═ K 1 ,k 2 ,…,k m And a set of log triples D corresponding to L ═ D 1 ,d 2 ,…,d n In which d is i Is log i Corresponding log triple (k, pid, ts), wherein k is a log template, pid is a process identifier, and ts is a time stamp generated by the log;
step 1-2: preprocessing a log template set K, inputting preprocessed data into a preset subject set to be T ═ T 1 ,t 2 ,…,t r Training the LDA model to generate a log template topic classification model LDA-CM based on LDA;
step 1-3: initializing a log template topic mapping dictionary TD, and calculating each log template K in a log template set K by using the LDA-CM model generated in the step 1-2 i Subject probability distribution vector Θ i ,Θ i Is equal to the number of topics in T, Θ i [j]Represents k i Belonging to a topic t j Then, obtain Θ i Subject t corresponding to the maximum probability value in (1) x Will map { k } i →t x Addition to TD;
step 1-4: processing the log triple set D according to the processes in the process set P, establishing log template topic sequences corresponding to the processes in the process set P by using a log template topic mapping dictionary TD, and recording the formed sequence set as S ═ S { (S) 1 ,S 2 ,…,S f };
Step 1-5: using a sliding window mechanism, for each process p in the sequence set S formed in steps 1-4 i Log template topic sequence S i Processing to generate a training sample set TP;
1-6: and (3) training an LSTM model by using the training samples in the training sample set TP generated in the step (1-5) to generate an LSTM-based process log anomaly detection model LSTM-ADM.
3. The method for detecting log abnormality based on LDA subject feature of claim 2, wherein: in the steps 1 to 4, the log triple set D is processed, and the specific steps are as follows:
step 1-4-1: dividing D into a plurality of subsets according to the ID of each process in the process-oriented set P, namely pid in the log triples, and sequencing the log triples in each subset according to the timestamp ts, thereby providing each process P i Constructing a corresponding log triple sequence D i Then, D is obtained i The log template in each log triple of the process p is obtained i Corresponding log template sequence
Figure FDA0003700917100000021
Step 1-4-2: for each process P in the set of Processes P i Log template sequence of
Figure FDA0003700917100000022
Each log template k in (1) i,j Determining k by using the log template topic mapping relation in the log template topic mapping dictionary TD i,j Subject t of x And then is process p i Establishing a log template topic sequence S i =<t i,1 ,t i,2 ,…,t i,q >And the log template topic sequence corresponding to each process in the P forms a set S ═ { S ═ S 1 ,S 2 ,…,S f }。
4. The method for detecting log abnormality based on LDA subject feature of claim 2, wherein: the specific steps of generating the training sample set TP in the steps 1-5 are as follows:
step 1-5-1: initializing the length of a sliding window to be h, the sliding step length to be 1, and the training sample set TP to be empty;
step 1-5-2: for each process p in the sequence set S i Log template topic sequence S i =<t i,1 ,t i,2 ,…,t i,q >Is processed if S i If the number of the log template topics in the log template is less than h, namely q is less than or equal to h, S i Corresponding journal template topic window set W i Is empty; otherwise, by moving the sliding window, construct and S i Corresponding journal template topic window set W i ={w i,1 ,w i,2 ,…,w i,y In which w i,1 =<t i,1 ,t i,2 ,…,t i,h >,w i,2 =<t i,2 ,t i,3 ,…,t i,h+1 >,…,w i,y =<t i,q-h ,t i,q-h+1 ,…,t i,q-1 >Then construct training sample pairs (w) i,1 ,t i,h+1 )、(w i,2 ,t i,h+2 ) … and (w) i,y ,t i,q ) And adds these training sample pairs to the TP.
5. The method for detecting log abnormality based on LDA subject characteristics as claimed in claim 1, wherein: the abnormality detection of step 2 specifically includes the steps of:
step 2-1: the log sequence of the process p to be detected is L p =<log p,1 ,log p,2 ,…,log p,v >Using log parser pair L p The logs in the process are sequentially processed to generate a log template sequence K of the process p =<k p,1 ,k p,2 ,…,k p,v >;
Step 2-2: mapping dictionary TD by using log template theme, and mapping log template sequence K p Conversion to a Log template topic sequence S p =<t p,1 ,t p,2 ,…,t p,v >;
Step 2-3: using sliding window mechanism to S in step 2-2 p =<t p,1 ,t p,2 ,…,t p,v >And processing, carrying out anomaly detection by using an LSTM-ADM model, and returning a detection result.
6. The log anomaly detection method based on LDA subject characteristics as claimed in claim 5, wherein: step 2-2 is to log template sequence K p Conversion to a Log template topic sequence S p The method comprises the following specific steps:
step 2-2-1: initialization S p Is a null sequence;
step 2-2-2: sequential treatment of K p Each log template k in p,i In the log template topic mapping dictionary TD, whether the k exists or not is inquired p,i Subject mapping of { k } p,i →t j H, if there is, will t j Is added to S p The preparation method comprises the following steps of (1) performing; otherwise, k is added p,i Inputting LDA-CM model to obtain k p,i Log template topic probability distribution vector Θ p,i Obtaining theta p,i Is not set to theta p,i [x]Then log template k p,i Corresponding topic is t x Then t is added x Is added to S p In (e), a new log template topic map k is created at the same time p,i →t x And adding the mapping into TD to finally obtain K p Corresponding log template topic sequence S p =<t p,1 ,t p,2 ,…,t p,v >。
7. The log anomaly detection method based on LDA subject characteristics as claimed in claim 5, wherein: subject sequence S by log template for process p in step 2-3 p =<t p,1 ,t p,2 ,…,t p,v >The method realizes the process log abnormity detection, and comprises the following specific steps:
step 2-3-1: initializing the sliding window with the length h and the sliding step length 1, and collecting the subject window set W of the log template p Null, detection pair set DP null;
step 2-3-2: judgment S p If the number of the log template topics is not more than the length of the sliding window, namely v is not more than h, the abnormal detection of the process p is finished; otherwise, the next step is carried out continuously;
step 2-3-3: by moving the sliding window, the structure and S p Corresponding journal template topic window set W p ={w p,1 ,w p,2 ,…,w p,y In which w p,1 =<t p,1 ,t p,2 ,…,t p,h >,w p,2 =<t p,2 ,t p,3 ,…,t p,h+1 >,…,w p,y =<t p,v-h ,t p,v -h+1,…,t p,v-1 >Then, a set of detection pairs DP { (w) is constructed p,1 ,t p,h+1 ),(w p,2 ,t p,h+2 ),…,(w p,y ,t p,v )};
Step 2-3-4: detecting pairs (w) for each of the DPs p,i ,t p,h+i ) Subject window w of log template p,i Inputting a log abnormity detection model LSTM-ADM to obtain a window w p,i Predicted log template topic probability distribution vector V for the next log p,i ,V p,i Is equal to the number of topics in T, V p,i [j]Representing a window w p,i Subject of the predicted log template of the next log of (1) belongs to the subject t j Then, V is obtained p,i The topics corresponding to the g maximum probability values in the prediction log form a prediction log template topic set CS p,i If, if
Figure FDA0003700917100000041
The process p is abnormal, and the detection is finished;
step 2-3-5: and when all detection pairs in the DP are processed and no abnormality is detected, the process p has no abnormality and the detection is finished.
CN202210689100.4A 2022-06-17 2022-06-17 Log anomaly detection method based on LDA theme characteristics Pending CN114969761A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210689100.4A CN114969761A (en) 2022-06-17 2022-06-17 Log anomaly detection method based on LDA theme characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210689100.4A CN114969761A (en) 2022-06-17 2022-06-17 Log anomaly detection method based on LDA theme characteristics

Publications (1)

Publication Number Publication Date
CN114969761A true CN114969761A (en) 2022-08-30

Family

ID=82963994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210689100.4A Pending CN114969761A (en) 2022-06-17 2022-06-17 Log anomaly detection method based on LDA theme characteristics

Country Status (1)

Country Link
CN (1) CN114969761A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116841650A (en) * 2023-08-31 2023-10-03 腾讯科技(深圳)有限公司 Sample construction method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116841650A (en) * 2023-08-31 2023-10-03 腾讯科技(深圳)有限公司 Sample construction method, device, equipment and storage medium
CN116841650B (en) * 2023-08-31 2023-11-21 腾讯科技(深圳)有限公司 Sample construction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107315956B (en) It is a kind of for quick and precisely detecting the Graph-theoretical Approach of Malware on the zero
CN109918505B (en) Network security event visualization method based on text processing
CN113434357A (en) Log abnormity detection method and device based on sequence prediction
CN109670318B (en) Vulnerability detection method based on cyclic verification of nuclear control flow graph
US11533373B2 (en) Global iterative clustering algorithm to model entities&#39; behaviors and detect anomalies
CN111598179A (en) Power monitoring system user abnormal behavior analysis method, storage medium and equipment
CN117081858B (en) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
CN112100137A (en) Unmanned aerial vehicle anomaly detection method based on multi-log collaborative analysis
CN116107834A (en) Log abnormality detection method, device, equipment and storage medium
Liu et al. FewM-HGCL: Few-shot malware variants detection via heterogeneous graph contrastive learning
CN114969761A (en) Log anomaly detection method based on LDA theme characteristics
Xie et al. An attention-based gru network for anomaly detection from system logs
CN112583847B (en) Method for network security event complex analysis for medium and small enterprises
CN111786999B (en) Intrusion behavior detection method, device, equipment and storage medium
CN115221013B (en) Method, device and equipment for determining log mode
CN116467720A (en) Intelligent contract vulnerability detection method based on graph neural network and electronic equipment
CN112733144B (en) Intelligent malicious program detection method based on deep learning technology
CN115278752A (en) AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system
CN111079145B (en) Malicious program detection method based on graph processing
CN111565192A (en) Credibility-based multi-model cooperative defense method for internal network security threats
Xie et al. Industrial Internet Vulnerability Detection Method Based on CBAM-CNN-SVM
Zheng et al. Using complex network communities to evaluate the correctness of object detection
Nandakumar et al. A Novel Approach to User Agent String Parsing for Vulnerability Analysis Using Multi-Headed Attention
CN111125699B (en) Malicious program visual detection method based on deep learning
Chen et al. Avminer: Expansible and semantic-preserving anti-virus labels mining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination