CN115757062A

CN115757062A - Log anomaly detection method based on sentence embedding and Transformer-XL

Info

Publication number: CN115757062A
Application number: CN202211437329.5A
Authority: CN
Inventors: 周宇; 曹英楠
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-07

Abstract

The invention discloses a method for detecting log abnormality based on sentence embedding and Transformer-XL, which comprises the following steps: obtaining the structured log data without variable log parameters from the unstructured original log data through a log analyzer; preprocessing structured log data; performing semantic vectorization representation on the structured log statement according to the pre-trained GloVe word vector and the InferSent sentence embedding model; generating a relative position code of the log event by using a sine position code matrix; and (4) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection. Compared with the prior art, the method and the device have the advantages that the sentence embedding technology is utilized, the semantic information of the whole log statement machine is expressed as the vector, the loss of valuable information in log information is effectively prevented, the transform-XL model is utilized, the limitation of the model based on the RNN is overcome, and the anomaly detection accuracy rate is improved.

Description

Log anomaly detection method based on sentence embedding and Transformer-XL

Technical Field

The invention belongs to the technical field of log anomaly detection, and particularly relates to a log anomaly detection method based on sentence embedding and Transformer-XL.

Background

The anomaly detection aims to find out abnormal system behaviors in time, plays an important role in event management of a large-scale system, and allows system developers (or operators) to find out problems in time and solve the problems immediately, so that the system downtime is reduced. The log data is text data generated by a program developer to assist debugging of printout code embedded in the program, and is used for recording variable information, program execution state and the like during program operation. Monitoring data attention system states and coarse-grained application states, such as process states, service states and the like; and the log data focuses on the application state of fine granularity and the program execution logic of cross-components, can position specific log and event information, and can position an abnormal request instance, namely a log output sequence of distributed cross-components, so that the execution track of the request can be reflected to a certain extent, and therefore, the log data is more suitable for abnormal detection and subsequent fault diagnosis tasks.

With conventional standalone systems, developers manually review system logs or write rules to detect anomalies based on their domain knowledge additionally use keyword searching or regular expression matching. However, such anomaly detection, which relies heavily on manual audit logs, has become inadequate for large-scale systems based on the following: 1) The large scale and parallelism of modern systems complicates system behavior too much, and each developer is usually only responsible for sub-components of the system, and developers may have only an incomplete understanding of the overall system behavior, and therefore identifying problems from a large number of logs is a huge challenge. 2) Modern systems are generating large volumes of logs at speeds of about 50GB per hour, which make it very difficult to manually identify key information from noisy data for anomaly detection. 3) Large-scale systems are typically built using different fault tolerance mechanisms. Systems sometimes run the same task redundantly and even actively terminate speculative tasks to improve performance. In such cases, conventional methods using keyword searching become ineffective for extracting suspicious log messages in these systems, which may result in many false positives, causing the workload of manual inspection to increase substantially.

At present, many conventional machine learning models are proposed to identify abnormal events from log messages, and these methods extract useful features from log messages and analyze log data by using machine learning algorithms, however, the conventional machine learning models have difficulty in capturing time information of discrete log messages. In recent years, the Recurrent Neural Network (RNN) in the deep learning model has been widely used for log anomaly detection because it can model sequence data, but there are some limitations to modeling log data using RNNs, such as not being able to make every log in a sequence encode context information from left and right contexts, e.g., RNNs focus primarily on capturing correlations between log messages in a normal sequence. When such a correlation in the log sequence is broken, the RNN model cannot correctly predict the next log message from the previous log message. Aiming at the defect that log semantic information and log context information cannot be coded in an RNN model, a word embedding work is carried out on a log statement to obtain a semantic vector, but meaningful semantic information extracted by the word embedding is limited.

With the adoption of the Transformer-XL model and the sentence embedding technology, the defects of the RNN and the word embedding technology are effectively improved, so that the log anomaly detection method is improved by utilizing the two technologies, and the anomaly detection accuracy is effectively improved.

Compared with the prior art, the method and the device have the advantages that the sentence embedding technology is utilized, the semantic information of the whole log statement machine is expressed as the vector, the loss of valuable information in log information is effectively prevented, the transform-XL model is utilized, the limitation of the model based on the RNN is overcome, and the anomaly detection accuracy rate is improved.

Disclosure of Invention

In view of the above disadvantages of the prior art, the present invention provides a method for detecting log anomalies based on sentence embedding and transform-XL, so as to solve the problems of RNN model deficiencies and limited semantic information of word embedding in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention discloses a sentence embedding and transform-XL-based log anomaly detection method, which comprises the following steps of:

(1) Extracting a log event from a log statement by using unstructured original log data through a log parser Drain to obtain structured log data without variable log parameters;

(2) Preprocessing structured log data;

(3) Performing semantic vectorization representation on the structured log statement according to the pre-trained GloVe word vector and the InferSent sentence embedding model;

(4) Generating a relative position code of the log event by using a sine position code matrix;

(5) And (4) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection.

Preferably, the step (I) of converting the unstructured log data into the structured log data by the log parser Drain specifically includes: when new original log data arrives, drain preprocesses the new original log data through a simple regular expression according to domain knowledge, searches a log group according to a special design rule coded in a node inside a tree, if a proper log group is found, the log data is matched with log events stored in the log group, and if not, a new log group is created according to the log data.

Preferably, the preprocessing the structured log data in step (2) specifically includes: remove non-character markers (delimiters, operators, etc.), stop words in the structured log data, and split compound words into individual words according to Camel Case.

Preferably, the performing semantic vectorization representation on the structured log statement according to the pre-trained GloVe word vector and the insersent sentence embedding model in the step (3) specifically includes: loading a pre-trained GloVe word vector and an InferSent sentence embedding model, setting relevant parameters of the InferSent sentence embedding model, constructing a vocabulary according to the existing structured log sentences, and encoding the structured log sentences by using the InferSent sentence embedding model to generate sentence embedding with a fixed dimension of 300, namely semantic vectors.

Preferably, the generating of the relative position code of the log event by using the sinusoidal position code matrix in the step (4) specifically includes: the sinusoidal position code calculation formula is as follows:

PE _(pos，2i) ＝sin(pos/10000 ^2i/d )

PE _(pos，2i+1) ＝cos(pos/10000 ^2i/d )

wherein PE _(pos，2i) ，PE _(pos，2i+1) Respectively 2i,2i +1 component of the encoded vector at position pos, d is the vector dimension.

Preferably, the transform-XL classification model in step (5) specifically includes: the semantic vector and relative position code of the log statement are used as the input of the model, the encoder part in the model comprises a multi-head attention layer and a position feedforward layer, the multi-head attention layer calculates attention weight for each log statement with different attention modes, and the calculation formula is as follows:

considering sentence embedding and relative position coding, Q ^T The complete expression for K is:

wherein E _x Input embedding representing tokens; w is a group of _q Is a query matrix; w is a group of _k，E And W _k，R Respectively representing a content-based key vector and a position-based key vector, and indicating that the input sequence and the position code are not sharing the weight; r is _i-j Representing relative position coding, i-j > =0 since i uses only the previous sequence; two new learnable parameters u, v, indicate that the corresponding query vectors are the same for all query locations, i.e. the attention bias for different words remains the same regardless of the query location.

And (3) outputting the model to a pooling layer, a dropout layer and a full connection layer, and finally calculating the normal or abnormal probability of the log sequence through a softmax classifier so as to identify abnormal logs.

The invention has the following beneficial effects:

the invention utilizes a sentence embedding technology to represent the semantic information of the whole log statement machine as a vector, thereby effectively preventing the loss of valuable information in the log message, and also utilizes a Transformer-XL model to overcome the limitation of the model based on RNN (radio network node), thereby improving the accuracy of abnormal detection.

Drawings

Fig. 1 is a schematic diagram of the principle framework of the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

Referring to fig. 1, the method for detecting log abnormality based on sentence embedding and Transformer-XL according to the present invention includes the following steps:

(2) Preprocessing structured log data;

(3) Performing semantic vectorization representation on the structured log sentences according to the pre-trained GloVe word vectors and the InferSent sentence embedding model;

The step (1) specifically comprises: converting unstructured log data into structured log data by a log parser Drain specifically includes: when new original log data arrives, drain preprocesses the new original log data through a simple regular expression according to domain knowledge, searches a log group according to a special design rule coded in a node inside a tree, if a proper log group is found, the log data is matched with log events stored in the log group, and if not, a new log group is created according to the log data.

In the step (2), preprocessing the structured log data specifically includes: remove non-character markers (delimiters, operators, etc.), stop words in the structured log data, and split the compound word into individual words according to Camel Case.

In the step (3), performing semantic vectorization representation on the structured log statement according to the pre-trained GloVe word vector and the insersent sentence embedding model specifically includes: loading a pre-trained GloVe word vector and an InferSent sentence embedding model, setting relevant parameters of the InferSent sentence embedding model, constructing a vocabulary list according to the existing structured log sentences, and encoding the structured log sentences by using the InferSent sentence embedding model to generate sentence embedding with a fixed dimension of 300, namely semantic vectors.

The step (4) specifically comprises: generating a relative position code of the log event by using a sine position code matrix, wherein the sine position code is calculated by the following formula:

PE _(pos，2i) ＝sin(pos/10000 ^2i/d )

PE _(pos，2i+1) ＝cos(pos/10000 ^2i/d )

wherein the PE _(pos，2i) ，PE _(pos，2i+1) 2i,2i +1 th component of the encoded vector at position pos, respectively, d is the vector dimension.

The Transformer-XL classification model in the step (5) specifically comprises the following steps: the semantic vector and relative position code of the log statement are used as the input of the model, the encoder part in the model comprises a multi-head attention layer and a position feedforward layer, the multi-head attention layer calculates attention weight for each log statement with different attention modes, and the calculation formula is as follows:

considering sentence embedding and relative position coding, Q ^T Complete expression of KComprises the following steps:

wherein E _x Input embedding representing tokens; w _q Is a query matrix; w _k，E And W _k，R Respectively representing a content-based key vector and a position-based key vector, and indicating that the input sequence and the position code are not sharing the weight; r _i-j Representing relative position coding, i-j > =0 since i uses only the previous sequence; two new learnable parameters u, v, indicate that the corresponding query vectors are the same for all query locations, i.e. the attention bias for different words remains the same regardless of the query location.

In summary, the invention (1) extracts the log event from the log statement by the log parser Drain from the unstructured original log data to obtain the structured log data without variable log parameters; (2) preprocessing the structured log data; (3) Performing semantic vectorization representation on the structured log sentences according to the pre-trained GloVe word vectors and the InferSent sentence embedding model; (4) Generating a relative position code of the log event by using a sine position code matrix; (5) And (3) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection. The invention utilizes a sentence embedding technology to represent the semantic information of the whole log statement machine as a vector, thereby effectively preventing the loss of valuable information in the log message, and also utilizes a Transformer-XL model to overcome the limitation of the model based on RNN (radio network node), thereby improving the accuracy of abnormal detection.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention may be apparent to those skilled in the relevant art and are intended to be within the scope of the present invention.

Claims

1. A log anomaly detection method based on sentence embedding and Transformer-XL is characterized by comprising the following steps:

(2) Preprocessing structured log data;

(5) And (3) taking the semantic vector and the relative position code of the log statement as the input of a transform-XL classification model to obtain the result of log anomaly detection.

2. The method for detecting log anomalies based on sentence embedding and Transformer-XL as claimed in claim 1, wherein the step (1) of converting unstructured log data into structured log data by a log parser Drain specifically includes: when new original log data arrive, the Drain preprocesses the new original log data through a simple regular expression according to domain knowledge, searches a log group according to a special design rule coded in a node inside a tree, if a proper log group is found, the log data are matched with log events stored in the log group, and if the proper log group is not found, a new log group is created according to the log data.

3. The method for detecting log anomaly based on sentence embedding and Transformer-XL as claimed in claim 2, wherein the preprocessing of the structured log data in step (2) specifically comprises: remove non-character markers (delimiters, operators, etc.), stop words in the structured log data, and split the compound word into individual words according to Camel Case.

4. The method according to claim 3, wherein the semantic vectorization of the structured log statement according to the pre-trained GloVe word vector and the insersent sentence embedding model in step (3) specifically comprises: loading a pre-trained GloVe word vector and an InferSent sentence embedding model, setting relevant parameters of the InferSent sentence embedding model, constructing a vocabulary according to the existing structured log sentences, and encoding the structured log sentences by using the InferSent sentence embedding model to generate sentence embedding with a fixed dimension of 300, namely semantic vectors.

5. The sentence embedding and transform-XL-based log anomaly detection method according to claim 1, wherein the step (4) of generating the relative position coding of the log events by using a sinusoidal position coding matrix specifically comprises: the sinusoidal position code calculation formula is as follows:

PE _(pos，2i) ＝sin(pos/10000 ^2i/d )

PE _(pos，2i+1) ＝cos(pos/10000 ^2i/d )

wherein the PE _(pos，2i) ，PE _(pos，2i+1) Respectively 2i,2i +1 component of the encoded vector at position pos, d is the vector dimension.

6. The method for detecting log anomalies based on sentence embedding and Transformer-XL as claimed in claim 1, wherein the Transformer-XL classification model in the step (5) specifically includes: the semantic vector and relative position code of the log statement are used as the input of the model, the encoder part in the model comprises a multi-head attention layer and a position feedforward layer, the multi-head attention layer calculates attention weight for each log statement with different attention modes, and the calculation formula is as follows:

wherein E _x Input embedding representing tokens; w is a group of _q Is a query matrix; w _k，E And W _k，R Respectively representing a content-based key vector and a position-based key vector, and indicating that the input sequence and the position code are not sharing the weight; r _i-j Representing relative position coding, i-j > =0 since i uses only the previous sequence; two new learnable parameters u, v, indicate that the corresponding query vectors are the same for all query locations, i.e. the attention bias for different words remains the same regardless of the query location.