CN115757062A - Log anomaly detection method based on sentence embedding and Transformer-XL - Google Patents

Log anomaly detection method based on sentence embedding and Transformer-XL Download PDF

Info

Publication number
CN115757062A
CN115757062A CN202211437329.5A CN202211437329A CN115757062A CN 115757062 A CN115757062 A CN 115757062A CN 202211437329 A CN202211437329 A CN 202211437329A CN 115757062 A CN115757062 A CN 115757062A
Authority
CN
China
Prior art keywords
log
sentence embedding
model
structured
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211437329.5A
Other languages
Chinese (zh)
Inventor
周宇
曹英楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202211437329.5A priority Critical patent/CN115757062A/en
Publication of CN115757062A publication Critical patent/CN115757062A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method for detecting log abnormality based on sentence embedding and Transformer-XL, which comprises the following steps: obtaining the structured log data without variable log parameters from the unstructured original log data through a log analyzer; preprocessing structured log data; performing semantic vectorization representation on the structured log statement according to the pre-trained GloVe word vector and the InferSent sentence embedding model; generating a relative position code of the log event by using a sine position code matrix; and (4) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection. Compared with the prior art, the method and the device have the advantages that the sentence embedding technology is utilized, the semantic information of the whole log statement machine is expressed as the vector, the loss of valuable information in log information is effectively prevented, the transform-XL model is utilized, the limitation of the model based on the RNN is overcome, and the anomaly detection accuracy rate is improved.

Description

Log anomaly detection method based on sentence embedding and Transformer-XL
Technical Field
The invention belongs to the technical field of log anomaly detection, and particularly relates to a log anomaly detection method based on sentence embedding and Transformer-XL.
Background
The anomaly detection aims to find out abnormal system behaviors in time, plays an important role in event management of a large-scale system, and allows system developers (or operators) to find out problems in time and solve the problems immediately, so that the system downtime is reduced. The log data is text data generated by a program developer to assist debugging of printout code embedded in the program, and is used for recording variable information, program execution state and the like during program operation. Monitoring data attention system states and coarse-grained application states, such as process states, service states and the like; and the log data focuses on the application state of fine granularity and the program execution logic of cross-components, can position specific log and event information, and can position an abnormal request instance, namely a log output sequence of distributed cross-components, so that the execution track of the request can be reflected to a certain extent, and therefore, the log data is more suitable for abnormal detection and subsequent fault diagnosis tasks.
With conventional standalone systems, developers manually review system logs or write rules to detect anomalies based on their domain knowledge additionally use keyword searching or regular expression matching. However, such anomaly detection, which relies heavily on manual audit logs, has become inadequate for large-scale systems based on the following: 1) The large scale and parallelism of modern systems complicates system behavior too much, and each developer is usually only responsible for sub-components of the system, and developers may have only an incomplete understanding of the overall system behavior, and therefore identifying problems from a large number of logs is a huge challenge. 2) Modern systems are generating large volumes of logs at speeds of about 50GB per hour, which make it very difficult to manually identify key information from noisy data for anomaly detection. 3) Large-scale systems are typically built using different fault tolerance mechanisms. Systems sometimes run the same task redundantly and even actively terminate speculative tasks to improve performance. In such cases, conventional methods using keyword searching become ineffective for extracting suspicious log messages in these systems, which may result in many false positives, causing the workload of manual inspection to increase substantially.
At present, many conventional machine learning models are proposed to identify abnormal events from log messages, and these methods extract useful features from log messages and analyze log data by using machine learning algorithms, however, the conventional machine learning models have difficulty in capturing time information of discrete log messages. In recent years, the Recurrent Neural Network (RNN) in the deep learning model has been widely used for log anomaly detection because it can model sequence data, but there are some limitations to modeling log data using RNNs, such as not being able to make every log in a sequence encode context information from left and right contexts, e.g., RNNs focus primarily on capturing correlations between log messages in a normal sequence. When such a correlation in the log sequence is broken, the RNN model cannot correctly predict the next log message from the previous log message. Aiming at the defect that log semantic information and log context information cannot be coded in an RNN model, a word embedding work is carried out on a log statement to obtain a semantic vector, but meaningful semantic information extracted by the word embedding is limited.
With the adoption of the Transformer-XL model and the sentence embedding technology, the defects of the RNN and the word embedding technology are effectively improved, so that the log anomaly detection method is improved by utilizing the two technologies, and the anomaly detection accuracy is effectively improved.
Compared with the prior art, the method and the device have the advantages that the sentence embedding technology is utilized, the semantic information of the whole log statement machine is expressed as the vector, the loss of valuable information in log information is effectively prevented, the transform-XL model is utilized, the limitation of the model based on the RNN is overcome, and the anomaly detection accuracy rate is improved.
Disclosure of Invention
In view of the above disadvantages of the prior art, the present invention provides a method for detecting log anomalies based on sentence embedding and transform-XL, so as to solve the problems of RNN model deficiencies and limited semantic information of word embedding in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention discloses a sentence embedding and transform-XL-based log anomaly detection method, which comprises the following steps of:
(1) Extracting a log event from a log statement by using unstructured original log data through a log parser Drain to obtain structured log data without variable log parameters;
(2) Preprocessing structured log data;
(3) Performing semantic vectorization representation on the structured log statement according to the pre-trained GloVe word vector and the InferSent sentence embedding model;
(4) Generating a relative position code of the log event by using a sine position code matrix;
(5) And (4) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection.
Preferably, the step (I) of converting the unstructured log data into the structured log data by the log parser Drain specifically includes: when new original log data arrives, drain preprocesses the new original log data through a simple regular expression according to domain knowledge, searches a log group according to a special design rule coded in a node inside a tree, if a proper log group is found, the log data is matched with log events stored in the log group, and if not, a new log group is created according to the log data.
Preferably, the preprocessing the structured log data in step (2) specifically includes: remove non-character markers (delimiters, operators, etc.), stop words in the structured log data, and split compound words into individual words according to Camel Case.
Preferably, the performing semantic vectorization representation on the structured log statement according to the pre-trained GloVe word vector and the insersent sentence embedding model in the step (3) specifically includes: loading a pre-trained GloVe word vector and an InferSent sentence embedding model, setting relevant parameters of the InferSent sentence embedding model, constructing a vocabulary according to the existing structured log sentences, and encoding the structured log sentences by using the InferSent sentence embedding model to generate sentence embedding with a fixed dimension of 300, namely semantic vectors.
Preferably, the generating of the relative position code of the log event by using the sinusoidal position code matrix in the step (4) specifically includes: the sinusoidal position code calculation formula is as follows:
PE (pos,2i) =sin(pos/10000 2i/d )
PE (pos,2i+1) =cos(pos/10000 2i/d )
wherein PE (pos,2i) ,PE (pos,2i+1) Respectively 2i,2i +1 component of the encoded vector at position pos, d is the vector dimension.
Preferably, the transform-XL classification model in step (5) specifically includes: the semantic vector and relative position code of the log statement are used as the input of the model, the encoder part in the model comprises a multi-head attention layer and a position feedforward layer, the multi-head attention layer calculates attention weight for each log statement with different attention modes, and the calculation formula is as follows:
Figure BSA0000289083850000031
considering sentence embedding and relative position coding, Q T The complete expression for K is:
Figure BSA0000289083850000032
wherein E x Input embedding representing tokens; w is a group of q Is a query matrix; w is a group of k,E And W k,R Respectively representing a content-based key vector and a position-based key vector, and indicating that the input sequence and the position code are not sharing the weight; r is i-j Representing relative position coding, i-j > =0 since i uses only the previous sequence; two new learnable parameters u, v, indicate that the corresponding query vectors are the same for all query locations, i.e. the attention bias for different words remains the same regardless of the query location.
And (3) outputting the model to a pooling layer, a dropout layer and a full connection layer, and finally calculating the normal or abnormal probability of the log sequence through a softmax classifier so as to identify abnormal logs.
The invention has the following beneficial effects:
the invention utilizes a sentence embedding technology to represent the semantic information of the whole log statement machine as a vector, thereby effectively preventing the loss of valuable information in the log message, and also utilizes a Transformer-XL model to overcome the limitation of the model based on RNN (radio network node), thereby improving the accuracy of abnormal detection.
Drawings
Fig. 1 is a schematic diagram of the principle framework of the present invention.
Detailed Description
In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.
Referring to fig. 1, the method for detecting log abnormality based on sentence embedding and Transformer-XL according to the present invention includes the following steps:
(1) Extracting a log event from a log statement by using unstructured original log data through a log parser Drain to obtain structured log data without variable log parameters;
(2) Preprocessing structured log data;
(3) Performing semantic vectorization representation on the structured log sentences according to the pre-trained GloVe word vectors and the InferSent sentence embedding model;
(4) Generating a relative position code of the log event by using a sine position code matrix;
(5) And (4) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection.
The step (1) specifically comprises: converting unstructured log data into structured log data by a log parser Drain specifically includes: when new original log data arrives, drain preprocesses the new original log data through a simple regular expression according to domain knowledge, searches a log group according to a special design rule coded in a node inside a tree, if a proper log group is found, the log data is matched with log events stored in the log group, and if not, a new log group is created according to the log data.
In the step (2), preprocessing the structured log data specifically includes: remove non-character markers (delimiters, operators, etc.), stop words in the structured log data, and split the compound word into individual words according to Camel Case.
In the step (3), performing semantic vectorization representation on the structured log statement according to the pre-trained GloVe word vector and the insersent sentence embedding model specifically includes: loading a pre-trained GloVe word vector and an InferSent sentence embedding model, setting relevant parameters of the InferSent sentence embedding model, constructing a vocabulary list according to the existing structured log sentences, and encoding the structured log sentences by using the InferSent sentence embedding model to generate sentence embedding with a fixed dimension of 300, namely semantic vectors.
The step (4) specifically comprises: generating a relative position code of the log event by using a sine position code matrix, wherein the sine position code is calculated by the following formula:
PE (pos,2i) =sin(pos/10000 2i/d )
PE (pos,2i+1) =cos(pos/10000 2i/d )
wherein the PE (pos,2i) ,PE (pos,2i+1) 2i,2i +1 th component of the encoded vector at position pos, respectively, d is the vector dimension.
The Transformer-XL classification model in the step (5) specifically comprises the following steps: the semantic vector and relative position code of the log statement are used as the input of the model, the encoder part in the model comprises a multi-head attention layer and a position feedforward layer, the multi-head attention layer calculates attention weight for each log statement with different attention modes, and the calculation formula is as follows:
Figure BSA0000289083850000041
considering sentence embedding and relative position coding, Q T Complete expression of KComprises the following steps:
Figure BSA0000289083850000051
wherein E x Input embedding representing tokens; w q Is a query matrix; w k,E And W k,R Respectively representing a content-based key vector and a position-based key vector, and indicating that the input sequence and the position code are not sharing the weight; r i-j Representing relative position coding, i-j > =0 since i uses only the previous sequence; two new learnable parameters u, v, indicate that the corresponding query vectors are the same for all query locations, i.e. the attention bias for different words remains the same regardless of the query location.
And (3) outputting the model to a pooling layer, a dropout layer and a full connection layer, and finally calculating the normal or abnormal probability of the log sequence through a softmax classifier so as to identify abnormal logs.
In summary, the invention (1) extracts the log event from the log statement by the log parser Drain from the unstructured original log data to obtain the structured log data without variable log parameters; (2) preprocessing the structured log data; (3) Performing semantic vectorization representation on the structured log sentences according to the pre-trained GloVe word vectors and the InferSent sentence embedding model; (4) Generating a relative position code of the log event by using a sine position code matrix; (5) And (3) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection. The invention utilizes a sentence embedding technology to represent the semantic information of the whole log statement machine as a vector, thereby effectively preventing the loss of valuable information in the log message, and also utilizes a Transformer-XL model to overcome the limitation of the model based on RNN (radio network node), thereby improving the accuracy of abnormal detection.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention may be apparent to those skilled in the relevant art and are intended to be within the scope of the present invention.

Claims (6)

1. A log anomaly detection method based on sentence embedding and Transformer-XL is characterized by comprising the following steps:
(1) Extracting a log event from a log statement by using unstructured original log data through a log parser Drain to obtain structured log data without variable log parameters;
(2) Preprocessing structured log data;
(3) Performing semantic vectorization representation on the structured log sentences according to the pre-trained GloVe word vectors and the InferSent sentence embedding model;
(4) Generating a relative position code of the log event by using a sine position code matrix;
(5) And (3) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection.
2. The method for detecting log anomalies based on sentence embedding and Transformer-XL as claimed in claim 1, wherein the step (1) of converting unstructured log data into structured log data by a log parser Drain specifically includes: when new original log data arrive, the Drain preprocesses the new original log data through a simple regular expression according to domain knowledge, searches a log group according to a special design rule coded in a node inside a tree, if a proper log group is found, the log data are matched with log events stored in the log group, and if the proper log group is not found, a new log group is created according to the log data.
3. The method for detecting log anomaly based on sentence embedding and Transformer-XL as claimed in claim 2, wherein the preprocessing of the structured log data in step (2) specifically comprises: remove non-character markers (delimiters, operators, etc.), stop words in the structured log data, and split the compound word into individual words according to Camel Case.
4. The method according to claim 3, wherein the semantic vectorization of the structured log statement according to the pre-trained GloVe word vector and the insersent sentence embedding model in step (3) specifically comprises: loading a pre-trained GloVe word vector and an InferSent sentence embedding model, setting relevant parameters of the InferSent sentence embedding model, constructing a vocabulary according to the existing structured log sentences, and encoding the structured log sentences by using the InferSent sentence embedding model to generate sentence embedding with a fixed dimension of 300, namely semantic vectors.
5. The sentence embedding and transform-XL-based log anomaly detection method according to claim 1, wherein the step (4) of generating the relative position coding of the log events by using a sinusoidal position coding matrix specifically comprises: the sinusoidal position code calculation formula is as follows:
PE (pos,2i) =sin(pos/10000 2i/d )
PE (pos,2i+1) =cos(pos/10000 2i/d )
wherein the PE (pos,2i) ,PE (pos,2i+1) Respectively 2i,2i +1 component of the encoded vector at position pos, d is the vector dimension.
6. The method for detecting log anomalies based on sentence embedding and Transformer-XL as claimed in claim 1, wherein the Transformer-XL classification model in the step (5) specifically includes: the semantic vector and relative position code of the log statement are used as the input of the model, the encoder part in the model comprises a multi-head attention layer and a position feedforward layer, the multi-head attention layer calculates attention weight for each log statement with different attention modes, and the calculation formula is as follows:
Figure FSA0000289083840000021
considering sentence embedding and relative position coding, Q T The complete expression for K is:
Figure FSA0000289083840000022
wherein E x Input embedding representing tokens; w is a group of q Is a query matrix; w k,E And W k,R Respectively representing a content-based key vector and a position-based key vector, and indicating that the input sequence and the position code are not sharing the weight; r i-j Representing relative position coding, i-j > =0 since i uses only the previous sequence; two new learnable parameters u, v, indicate that the corresponding query vectors are the same for all query locations, i.e. the attention bias for different words remains the same regardless of the query location.
And (3) outputting the model to a pooling layer, a dropout layer and a full connection layer, and finally calculating the normal or abnormal probability of the log sequence through a softmax classifier so as to identify abnormal logs.
CN202211437329.5A 2022-11-16 2022-11-16 Log anomaly detection method based on sentence embedding and Transformer-XL Pending CN115757062A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211437329.5A CN115757062A (en) 2022-11-16 2022-11-16 Log anomaly detection method based on sentence embedding and Transformer-XL

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211437329.5A CN115757062A (en) 2022-11-16 2022-11-16 Log anomaly detection method based on sentence embedding and Transformer-XL

Publications (1)

Publication Number Publication Date
CN115757062A true CN115757062A (en) 2023-03-07

Family

ID=85372275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211437329.5A Pending CN115757062A (en) 2022-11-16 2022-11-16 Log anomaly detection method based on sentence embedding and Transformer-XL

Country Status (1)

Country Link
CN (1) CN115757062A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116405326A (en) * 2023-06-07 2023-07-07 厦门瞳景智能科技有限公司 Information security management method and system based on block chain

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116405326A (en) * 2023-06-07 2023-07-07 厦门瞳景智能科技有限公司 Information security management method and system based on block chain
CN116405326B (en) * 2023-06-07 2023-10-20 厦门瞳景智能科技有限公司 Information security management method and system based on block chain

Similar Documents

Publication Publication Date Title
Zhang et al. Robust log-based anomaly detection on unstable log data
Le et al. Log-based anomaly detection without log parsing
CN104598813B (en) Computer intrusion detection method based on integrated study and semi-supervised SVM
CN113326244B (en) Abnormality detection method based on log event graph and association relation mining
CN113434357A (en) Log abnormity detection method and device based on sequence prediction
CN112307473A (en) Malicious JavaScript code detection model based on Bi-LSTM network and attention mechanism
Zhang et al. Log sequence anomaly detection based on local information extraction and globally sparse transformer model
CN115344414A (en) Log anomaly detection method and system based on LSTM-Transformer
CN114968727B (en) Database through infrastructure fault positioning method based on artificial intelligence operation and maintenance
Zhang et al. Putracead: Trace anomaly detection with partial labels based on gnn and pu learning
CN115757062A (en) Log anomaly detection method based on sentence embedding and Transformer-XL
CN113779590B (en) Source code vulnerability detection method based on multidimensional characterization
CN114416479A (en) Log sequence anomaly detection method based on out-of-stream regularization
Huang et al. Improving log-based anomaly detection by pre-training hierarchical transformers
Huangfu et al. System failure detection using deep learning models integrating timestamps with nonuniform intervals
CN116074092B (en) Attack scene reconstruction system based on heterogram attention network
CN114969334B (en) Abnormal log detection method and device, electronic equipment and readable storage medium
Egersdoerfer et al. Clusterlog: Clustering logs for effective log-based anomaly detection
Chen et al. Unsupervised Anomaly Detection Based on System Logs.
Shahid et al. Anomaly detection in system logs in the sphere of digital economy
CN115587007A (en) Robertta-based weblog security detection method and system
Xiao et al. Detecting anomalies in cluster system using hybrid deep learning model
CN113326371A (en) Event extraction method fusing pre-training language model and anti-noise interference remote monitoring information
Wang et al. FastTransLog: A Log-based Anomaly Detection Method based on Fastformer
Ouyang et al. Binary vulnerability mining based on long short-term memory network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination