CN117688488A - Log anomaly detection method based on semantic vectorization representation - Google Patents

Log anomaly detection method based on semantic vectorization representation Download PDF

Info

Publication number
CN117688488A
CN117688488A CN202311611947.1A CN202311611947A CN117688488A CN 117688488 A CN117688488 A CN 117688488A CN 202311611947 A CN202311611947 A CN 202311611947A CN 117688488 A CN117688488 A CN 117688488A
Authority
CN
China
Prior art keywords
log
words
semantic
anomaly detection
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311611947.1A
Other languages
Chinese (zh)
Inventor
章一磊
苑淑晴
龚声望
张广泽
王俊辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Normal University
Original Assignee
Anhui Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Normal University filed Critical Anhui Normal University
Priority to CN202311611947.1A priority Critical patent/CN117688488A/en
Publication of CN117688488A publication Critical patent/CN117688488A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a log anomaly detection method based on semantic vectorization representation, which comprises the following steps: s1, preprocessing log data to remove redundant information; s2, capturing context representation of words in the log message, extracting semantic meaning of the original log message, and representing the semantic meaning as a semantic vector; s3, classifying by adopting a classifier model based on a self-attention mechanism, and outputting a log abnormality detection result based on the classification result. The invention has the advantages that: the log sequence semantic information is fully utilized, the accuracy of anomaly detection is improved, and the reliability and the accuracy of log anomaly detection are improved.

Description

Log anomaly detection method based on semantic vectorization representation
Technical Field
The invention relates to the field of computer log abnormality detection, in particular to a log abnormality detection method based on semantic vectorization representation.
Background
With the continued development and popularity of computer technology, computer systems and network systems play an increasingly important role in our daily lives and works. These systems generate large amounts of log data that record the system's operation in text form and indicate the status of various key points and important activities for recording system operation status, events and anomaly information to assist administrators in performing system monitoring, troubleshooting, performance optimization, etc. Therefore, log anomaly detection can help to locate anomalies and conduct cause analysis so as to reduce error time and ensure normal operation of the system.
The traditional log anomaly detection method is mainly based on rules or machine learning algorithms, such as outlier detection, clustering, classification and other methods. However, these methods have some limitations in processing complex, high-dimensional log data. First, rule-based approaches often require reliance on manually defined rules to identify anomalies, which may not cover all anomalies, and design and maintenance of the rules itself is also a cumbersome task. Second, conventional machine learning methods typically require feature engineering relying on manual design to extract features suitable for anomaly detection. However, the high dimensionality and unstructured of log data makes the feature engineering process very complex and important features are easily missed. In addition, the conventional method often requires a long processing time when processing large-scale log data, and is difficult to meet the requirements of real-time monitoring and rapid abnormality detection.
In order to overcome the limitations of the traditional method, the deep learning technology becomes a new research hotspot in the field of log anomaly detection. Deep learning is a machine learning technology based on a multi-layer neural network, and has the biggest advantage of automatically learning feature representation from data without relying on rules or feature engineering of manual design. Deep learning learns the high-order features of data through multi-level abstract representations, and can capture complex nonlinear relations in the data, so that the deep learning is excellent in processing complex log data. However, the existing log anomaly detection technology based on the deep learning model still has a certain limitation, ignores the problem of sharing semantics among a plurality of log sources, and cannot fully utilize the context information of the log, which can cause a certain limitation on the accuracy of system anomaly detection to a certain extent. Meanwhile, the prior method mostly analyzes the original log into a log template before abnormality detection is carried out on the log, and deletes some messages, and valuable information can be lost by the method, so that misunderstanding of the semantics of the log messages is caused.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a log abnormality detection method based on semantic vectorization representation, which is used for improving the reliability and accuracy of log abnormality detection.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a log anomaly detection method based on semantic vectorization representation comprises the following steps:
s1, preprocessing log data to remove redundant information;
s2, capturing context representation of words in the log message, extracting semantic meaning of the original log message, and representing the semantic meaning as a semantic vector;
s3, classifying by adopting a classifier model based on a self-attention mechanism, and outputting a log abnormality detection result based on the classification result.
Step S1 includes splitting original log information into log information by separator, converting uppercase letters in the log information into lowercase letters, deleting all non-characters in the log information, and finally obtaining a group of words corresponding to the log information, wherein each word in each group of words is called a token.
The original log message is a semi-structured text, and comprises a head and content; the message header includes a timestamp, a verbose level representing the severity level of the event, and a component, the log content consisting of a constant portion and a variable portion.
In step S2, a BERT model is used to capture the contextual representation of the words in the log message, extract the semantic meaning of the original log message, and represent it as a semantic vector.
The step S2 includes:
converting text in the log message into words by using a vocabulary, marking the input log message as sub-words by using a WordPiece module by using text which is not in the vocabulary, and forming a set of words and sub-wordsAnd then fed into the BERT model and encoded into a vector representation with fixed dimensions.
The WordPiece module is configured to: wordPieces first includes all characters and symbols into its basic vocabulary S; then using language model to calculate sentence likelihood probability value, defining j to represent j-th group log sequence, k to represent k-th log in log sequence, then log dataConsists of n words, t i Representing sub-words, if each word exists independently, log data ++>Is equivalent to the product of probabilities of all words:
WordPiece selects two subwords t from the vocabulary at a time x And t y Merging into a new subword t z Combining words to obtain new words which are not in a word stock, and then recalculating likelihood probability values of sentences; log data at this timeThe change in likelihood values can be expressed as:
the change of the likelihood value is mutual information between two words, and each time WordPiece selects two combined words, the two combined words have the largest mutual information value, namely the two words have stronger relevance on a language model; training a language model from the basic vocabulary, and selecting adjacent subwords which maximize the probability of the language model to be added to the vocabulary;
after merging the words, obtaining new words which are not in a word stock, adding the new words into a vocabulary, and training the language model on the new vocabulary again; repeating these steps until the desired vocabulary is reached
The BERT model utilizes a BERT base model to input a set of words and sub-wordsEncoding is performed with a 12-layer transform encoder and 768 hidden units per Transformer, each transform layer comprising multi-headed self-care and position feed-forward sub-layers, wherein residual connections are employed in both sub-layers, followed by layer normalization.
The semantic vector corresponding to the vector representation of the fixed dimension of the BERT model coding output is input into a classification model based on a multi-head self-attention mechanism and a convolution layer to detect the abnormality of the log message.
And (3) embedding the positions of the log sequences and inputting the sum of semantic vectors of the log sequences into a classifier model for anomaly detection.
The classifier model comprises an attribute block and a position feedforward layer attribute block which are a mixed structure, global context information is captured by adopting a multi-head self-Attention mechanism, local context information is extracted by adopting a convolution layer, finally global-local context is extracted as output by applying an add operation to the global context and the local context, then the output enters the FN and then enters the position feedforward layer, interlayer characteristics are connected into a feedforward network, the network comprises two fully connected layers, each fully connected layer is provided with a ReLU activation function, data is firstly mapped to a high-dimensional space and then mapped to a low-dimensional space, through normalization processing, the data is moved to an action area of the activation function Relu, nonlinear mapping learning is carried out, and finally the output of the classifier model is fed to a pooling layer, a discarding layer and a fully connected layer, and normal/abnormal log sequences are identified by using a Softmax function.
The invention has the advantages that: the log sequence semantic information is fully utilized, the accuracy of anomaly detection is improved, and the reliability and the accuracy of log anomaly detection are improved. The semantic meaning of the log message can be effectively represented. Since the anomaly detection language model uses the original log message (after preprocessing) for anomaly detection, the problem of inaccurate log parsing can be avoided. The results also indicate that the model can learn the meaning of OOV words efficiently.
Drawings
The contents of the drawings and the marks in the drawings of the present specification are briefly described as follows:
FIG. 1 is a schematic diagram of the composition of a log message;
FIG. 2 is a schematic diagram of BERT feature extraction in accordance with the present invention;
FIG. 3 is a schematic diagram of the structure of the classifier model of the present invention;
fig. 4 is a frame diagram of the LogFormer according to the present invention.
Detailed Description
The following detailed description of the invention refers to the accompanying drawings, which illustrate preferred embodiments of the invention in further detail.
In this embodiment, a log anomaly detection method (LogFormer) is provided, and an anomaly detection method that does not require log analysis is provided. The LogFormer extracts the semantic meaning of the original log message and represents it as a semantic vector. These semantic vectors are then used to detect anomalies through a classification model based on self-attention mechanisms that can effectively capture context information from the log sequence. Sufficient experiments were performed on two common public data sets, and the experimental results confirm the validity of LogFormer. Unlike the existing method, the LogFormer does not depend on any log parsing, thereby preventing information loss due to log parsing errors. A training model with multi-headed self-attention mechanisms and convolution can learn context information from a log sequence in vector representation. Experiments were performed on two published data sets with precision, recall, F-score as an evaluation index. The result shows that our model has a certain effectiveness in extracting semantics. The main contributions herein are as follows:
(1) We propose LogFormer, which is a deep learning method based on BERT encoder and self-attention module, which can detect system anomalies without log parsing, eliminating the negative impact of log parsing on system anomaly detection.
(2) The method for deep learning based on the BERT encoder and the self-attention module utilizes BERT to encode semantic information of log information, semantic meaning in log context can be fully mined, and long-term dependency relationship between log sequences can be effectively captured under the holding of a multi-head self-attention mechanism and a convolutional neural network.
(3) We have evaluated LogFormer using a common dataset. Experimental results prove the effectiveness and the robustness of the LogFormer.
1. In this section, the LogFormer anomaly detection framework will be described. The LogFormer firstly carries out preprocessing to remove redundant information, captures the context representation of words in log information by using the BERT model, and then classifies by using a classifier model based on a self-attention machine for better understanding the semantics of the log so as to achieve the purpose of detecting the abnormality of the log of the system.
1.1 Log data preprocessing
The original log message is a semi-structured text that contains a header and content. The message header is determined by the log framework and related information, which includes a time stamp, a lengthy level (e.g., WARN/INFO) representing the severity level of the event, and components. The log content consists of a constant part (revealing the key of the event template) and a variable part (carrying the parameters of the dynamic runtime information), as shown in fig. 1.
Preprocessing log data is the first step in modeling. In this step we first mark the log message as a set of words. For better tokenization, we split log messages using common separators (i.e., spaces, colon, commas, etc.) in the logging system. Each uppercase letter is then converted to a lowercase letter, and all non-character tokens are deleted from the word set. These non-characters contain operators, punctuation and digits, as they generally represent variables in a log message and have no information. For the original log message (17/06/09 20:11:10 INFO storage.BlockManager:Found block rdd 42 17 locally), first split into a set of words based on a common separator, then each letter is converted to a lower case letter, and all non-characters are deleted from the word set. Finally, a set of words { info, store, blockmanager, found, block, localy } is obtained, each word being called a Token.
1.2 vector representation
The log messages record important system operation information, and each log message records system events at a specific time. The output of these statements is the system's operational state and behavior described by the developer in natural language, including system-occurring operations, errors, warnings, and states, etc. Existing methods typically only analyze the message content and delete other information. In this context, logFormer uses all text information, such as verbose levels, components and content, to extract the semantic meaning of the log message. To preserve semantic information and capture the relationship between the existing log and the new log message, this stage attempts to represent the log message in vectors.
1.2.1WordPiece Tokenization
One basic tokenization method is to break text into words. However, using only this method, words not contained in the vocabulary will be regarded as "OOV words (out-of-vocabolar)". Modern NLP models solve this problem by marking text as subword units, which typically retain semantic information. Thus, even though the model may not know a word, a single subword tag may retain enough information for the model to infer its meaning to some extent. One popular and applicable subword tokenization technique for other NLP models, known as WordPiece, is widely used in recent language modeling studies. It involves splitting text into smaller units called labels (e.g., words or word segments) in order to convert unstructured input strings into a sequence of discrete elements suitable for use in a Machine Learning (ML) model.
We choose WordPiece to reduce the vocabulary and do the OOV word processing. WordPiece first includes all characters and symbols into its basic vocabulary S. Then using language model to calculate sentence likelihood probability value, defining j to represent j-th group log sequence, k to represent k-th log in log sequence, then log dataConsists of n words, t i Representing sub-words, if each word exists independently, log data ++>Is equivalent to the product of probabilities of all words:
WordPiece selects two subwords t from the vocabulary at a time x And t y Merging into a new subword t z And combining the words to obtain new words which are not in a word stock, and then recalculating likelihood probability values of sentences. Log data at this timeThe change in likelihood values can be expressed as:
from the formula, the change of likelihood value is the mutual information between two words. WordPieces select two words that merge each time, they have the largest mutual information value, i.e., the two words have a strong relevance on the language model. The sentences correspond to log sentences, and the composition is shown in figure 1. Words are words in the log statement that are only part of the log statement. The important meanings in a single log statement are typically expressed by these words. The phrase: after preprocessing, the single log sentence removes some unimportant numbers or symbols in the sentence, splits the sentence into individual words, and the words together form a phrase similar to { a word, b word, c word … … }.
It trains the language model starting from the basic vocabulary and selects the neighboring subwords that maximize the probability of the language model to add to the vocabulary. And merging the words to obtain new words which are not in a word stock, adding the new words into a vocabulary, and training the language model on the new vocabulary again. Repeating these steps until the desired vocabulary is reachedIn this way, the number of OOV words can be effectively reduced, e.g., the rare word "blockmanager" is split into more frequent sub-words: { "Block", "manager" }. Their meaning can also be fully captured, the vocabulary can be reduced, and many similar and semantically similar words in the vocabulary can be combined. The language model refers to a training model for detecting the whole abnormality, the WordPiece is a word segmentation device, and when a sub word is added into a phrase, whether the influence of the sub word on the whole language model is good or bad is considered to replace whether the sub word is actually added into the phrase.
1.2.2BERT Encoder
The log format is affected by different specification requirements to preserve the key meaning of the log statements, word embedding is introduced into the vector-encoded templates. To overcome the challenges of ambiguous words and changing events in log data, an advanced word embedding method is needed. The pre-trained language model has made considerable progress in the field of NLP, particularly BERT developed by Google. Google publishes a pre-trained language model that is trained on wikipedia corpuses and book corpuses. Compared to other embedding methods, large pre-trained language models provide enough word databases to more accurately encode words.
After tokenization, the set of words and sub-wordsIs passed to the BERT model and encoded into a vector representation with a fixed dimension. LogFormer utilizes a BERT base model that contains 12 layers of Transformer encoders and 768 hidden units per Transformer encoder. Each transducer layer includes multiple head self-care and position feed forward sub-layers, with residual connections being employed in both sub-layers, followed by layer normalization. Complete transducers include a transducer Encoder (Encoder) and a transducer Decoder (Decoder), but only a transducer Encoder is used herein, so the transducer layer = transducer Encoder herein.
Fig. 2 shows the feature extraction part of BERT, the log being first tokenized into M tokens (Tok represents a token). BERT adds a [ CLS ] at the beginning of the sentence]A tag, which refers to the starting position of a sentence. The embedded layer is composed of [ CLS ]]Generates an embedded vector Ei, where i refers to the i-th subword in the sentence. The embedded vector Ei is then fed to an encoder TM (transducer encoder) as a model input. Each layer generates an embedding for each sub-word in the log message. Word embedding generated using the last coding layer of BERT was used in our work. The subwords of the log message are calculated as their corresponding word-embedded average E semantic
As any word not present in the vocabulary (i.e., OOV word) is decomposed into sub-words, BERT may learn the representation vector of the OOV word based on the meaning of the sub-word set. Furthermore, the location embedding layer allows the BERT to capture a representation of a word in its context in a log message. The BERT also contains a self-attention mechanism that, when processing a log statement, creates correlations between all words in the statement and other words, effectively measuring the importance of each word in the sentence.
1.3 design classifier
After word embedding, the subwords in each log message are converted into semantic vectors E semantic The semantic vector of the log message is then taken as input (i.e., x= { X 1 ,x 2 ,...,x n }). To better understand the semantic information of log data, logFormer employs a classification model based on a multi-headed self-attention mechanism and convolutional layers to detect various anomalies.
2.3.1 relative position
The order of the log sequence conveys important information of the anomaly detection task, and the BERT encoder represents the log messages as a vector of fixed dimension, with log messages having similar meanings being closer to each other. However, these vectors do not contain relative position information of the log message in the log sequence. The LogFormer model thus inputs the sum of the position embedding and log sequence embedding into the classifier model. Location embedding T E R T*d Is generated by encoding the position information of the log sequence using sine functions of different frequencies:
where pos is the position of the pos in the log sequence; r is the r-th dimension of the d-dimensional embedding. Each dimension of the position code corresponds to a sinusoidal curve, and the trigonometric function wavelength of each coordinate position is different, ranging herein from 2 pi to 10000 x 2 pi in geometric progression.
A sinusoidal encoder is applied to each position i in the log sequence X, generating a position embedding using sin and cos functionsThen (I)>Semantic vector added to position i +.>
Where j represents the j-th group of log sequences and k represents the k-th log in the log sequences. In this way, the model can learn the relative location information of each log message in the sequence, and can distinguish between log messages at different locations. In the next step, the process is carried out,will be fed into the classification model (see fig. 4 (step 3)).
2.3.2 classifier
In conventional transformers, the gradient of the output layer is relatively large under the post-LN, which also results in training instability if the Transformer using the post-LN does not employ the learning rate wall-up strategy. The reason for the unstable training is that in the early stages of training, the residual connection results in a large offset in the output. The LogFormer model is provided, LN normalization processing is firstly carried out on data, so that the contribution degree of residual branches is controlled, and the training stability is ensured.
In the LogFormer module, the classifier is designed as an attribute block and a position feed-forward layer. The main module Attention block is a mixed structure, adopts a multi-head self-Attention mechanism to capture global context information, and adopts a convolution layer to extract local context information. Finally, an add operation is applied to the global context and the local context to extract the global-local context. Given inputAdded to the position embedding and fed to the classifier model as in fig. 3.
The hidden dimensions are denormalized after the data first enter the LN layer, i.e., the operations are performed for different features of a single sample. The method is not influenced by the size of the sample batch, and can ensure that meaning vectors converted by words in each sequence are on the same scale.Is an input representation of the log template sequence, the input minus the mean divided by the standard deviation, and a linear mapping is applied. The number of hidden nodes H, the number of network layers l, the normalized statistic mu, sigma of LN:
the statistics are calculated independently of the number of samples, and the number of the statistics is only determined by the number of hidden nodes. Normalized values:
where ε is a very small amount (default 10 -5 ) In order to prevent the denominator from being zero. Gamma, beta are two trainable parameters to ensure that the normalization operation does not corrupt the previous information.
The main module Attention block constructs two branches, a local branch and a global branch. Multiple head Self-attention is employed in the Multi-head Self-attention module, with multiple head attention taking L parallel Self-attention to jointly capture information at different locations and in different aspects on the log sequence. The h-th attention header above the attention layer is defined as:
wherein d is v Is the dimension of one header of the attention layer,is the linear projection weight on the h attention head, the dimension is +.>The attention mechanism enables logforce to capture long-term dependencies between log sequences, obtain an attention score matrix for each log message for different attention patterns, for higher efficiency and stronger sequence modeling to extract global context. Multi-head attention connects L parallel heads together as follows:
F(Y j )=Concat(head 1 ,...,head L )W o #(1)
wherein the method comprises the steps ofIs a projection matrix, d o Is the dimension of the multi-headed attention sub-layer output.
The local-enhanced local branch employs two parallel convolution layers of kernel size 1 to extract the local context. The feature detection layer of the CNN learns through training data, and because the weights of the neurons on the same feature mapping surface are the same, the network can learn in parallel, and local context information can be efficiently extracted by local weight sharing. Two normalization operations are added after the convolutional layer. Finally, the global and local training results are summed to combine the extracted features, and the generated global-local information is used as output.
The output then enters the FN and then enters the position feed-forward layer, and the interlayer features are connected into a feed-forward network which comprises two fully connected layers, and each fully connected layer has a ReLU activation function. The process of mapping data to a high-dimensional space and then to a low-dimensional space can learn more abstract features. Through carrying out standardization processing, data are moved to the action area of the activation function Relu, nonlinear mapping learning is carried out, the large part is strengthened, the small part is weakened, the expression capacity of a word is strengthened, the action relation of the word on other words can be strengthened, larger deflection is avoided in the initial training stage, and convergence of a training model is accelerated.
Fig. 4 shows a frame diagram of LogFormer. Finally, the output of the classifier model is fed to the pooling layer, discard layer and fully connected layer. The normal/abnormal log sequence was identified using the Softmax function:
where m represents the number of the output node, which "compresses" the vector z in the C-dimension into another real vector σ (z) in the C-dimension such that 0 < σ (z) m <1,∑ m σ(z) m =1. The cross entropy class loss function is calculated for the Softmax result as:
yc is a true value of a sample, the value in log is the Softmax value of the correct classification of the group of data, the larger the specific gravity of the Softmax value is, the smaller the Loss of the sample is, and the definition meets the requirements of people.
1.4 anomaly detection
Following the above steps, we can train the classifier model for log-based anomaly detection. The LogFormer is first pre-processed when a new set of log messages arrives. The new log message is then converted into a semantic vector. The log sequence, represented as a list of semantic vectors, is fed into the training model. Finally, it can be predicted whether the log sequence is abnormal based on the classifier model.
Experimental results and analysis
Experimental data set: experiments were performed on HDFS, BGL, spirit and thundered datasets. The details of these four datasets are as follows: HDFS data set: the HDFS dataset contained 11,175,629 logs generated by Hadoop, which were collected from more than 200 Amazon EC2 nodes. Sessions of the HDFS dataset are grouped by blockID, i.e. one blockID for each session. Of the 11,175,629 logs, there were 575,061 sessions in total. The normal or abnormal marking of these sessions has been marked by Hadoop domain experts. The normal session number is 558,223 and the abnormal session number is 16838.
BGL dataset: the BGL dataset was generated by a Blue Gene/L supercomputer consisting of 128K processors and deployed in the lorensliefremor national laboratory (LLNL). The BGL dataset contained 4,747,963 logs, 348,460 exceptions. The specific experimental data set settings are shown in table 3.1:
TABLE 3.1 Experimental dataset information
Evaluation index
Log anomaly detection is actually a binary classification problem, and in consideration of unbalance of a log data set, more normal logs and fewer abnormal logs are obtained. To measure the effectiveness of LogFormer in log sequence anomaly detection, we evaluated using accuracy, recall and F1-Score.
Precision (precision): the percentage of true log sequence anomalies detected among all anomalies detected.
Recall (recovery): of all the anomalies actually detected, the true log sequence accounts for the percentage of anomalies.
F1-Score (F1): a harmonic mean of accuracy and recall.
Where β is a parameter value. 0 < beta < 1 indicates greater effect of precision, and beta >1 indicates greater effect of values of recovery. In general, we set β to 1, i.e., denote that precision is consistent with the recovery impact value, and here we set β to 1.
TP (true positive) is the number of correctly detected abnormal log sequences, FP (false positive) is the number of normal log sequences that are incorrectly detected as abnormal, FN (false negative) is the number of incorrectly detected normal abnormal log sequences, and TN (true negative) is the number of correctly undetected non-abnormal samples.
Experimental results
In this section, verification will be performed on the HDFS, BGL datasets. We compared LogFormer with existing methods SVM, PCA, logClustering, logRobust, deepLog (log key sequence based method) and logAnomaly (log template based method). Table 3.2 shows experimental performance results for the HDFS dataset. Table 3.3 shows experimental performance results for BGL datasets.
TABLE 3.2 experimental results on HDFS data sets
TABLE 3.3 experimental results on BGL dataset
As can be seen from the experimental results, the conventional method PCA, logCluster has lower performance in the recovery and F1-Score indexes, and SVM, logCluster and deep have better performance in precision, but F1-Score is not ideal. In addition, the precision of the LogFormer on two data sets is higher, and the F1-score is also at a higher level compared with various methods, so that the vectorization representation method of the template is better than the vectorization representation method of the template in capturing the semantics of the log sequence, and the effectiveness of the log anomaly detection method in a mode without log analysis is explored.
It is obvious that the specific implementation of the present invention is not limited by the above-mentioned modes, and that it is within the scope of protection of the present invention only to adopt various insubstantial modifications made by the method conception and technical scheme of the present invention.

Claims (10)

1. A log anomaly detection method based on semantic vectorization representation is characterized by comprising the following steps: the method comprises the following steps:
s1, preprocessing log data to remove redundant information;
s2, capturing context representation of words in the log message, extracting semantic meaning of the original log message, and representing the semantic meaning as a semantic vector;
s3, classifying by adopting a classifier model based on a self-attention mechanism, and outputting a log abnormality detection result based on the classification result.
2. The log anomaly detection method based on semantic vectorization representation according to claim 1, wherein:
step S1 includes splitting original log information into log information by separator, converting uppercase letters in the log information into lowercase letters, deleting all non-characters in the log information, and finally obtaining a group of words corresponding to the log information, wherein each word in each group of words is called a token.
3. The log anomaly detection method based on semantic vectorization representation according to claim 2, wherein:
the original log message is a semi-structured text, and comprises a head and content; the message header includes a timestamp, a verbose level representing the severity level of the event, and a component, the log content consisting of a constant portion and a variable portion.
4. A log anomaly detection method based on semantic vectorized representation as claimed in any one of claims 1 to 3, wherein:
in step S2, a BERT model is used to capture the contextual representation of the words in the log message, extract the semantic meaning of the original log message, and represent it as a semantic vector.
5. The log anomaly detection method based on semantic vectorization representation according to claim 4, wherein:
the step S2 includes:
converting text in the log message into words by using a vocabulary, marking the input log message as sub-words by using a WordPiece module by using text which is not in the vocabulary, and forming a set of words and sub-wordsAnd then fed into the BERT model and encoded into a vector representation with fixed dimensions.
6. The log anomaly detection method based on semantic vectorization representation according to claim 5, wherein:
the WordPiece module is configured to: wordPieces first includes all characters and symbols into its basic vocabulary S; then using language model to calculate sentence likelihood probability value, defining j to represent j-th group log sequence, k to represent k-th log in log sequence, then log dataConsists of n words, t i Representing sub-words, if each word exists independently, log data ++>Is equivalent to the product of probabilities of all words:
WordPiece selects two subwords t from the vocabulary at a time x And t y Merging into a new subword t z Combining words to obtain new words which are not in a word stock, and then recalculating likelihood probability values of sentences; log data at this timeThe change in likelihood values can be expressed as:
the change of the likelihood value is mutual information between two words, and each time WordPiece selects two combined words, the two combined words have the largest mutual information value, namely the two words have stronger relevance on a language model; training a language model from the basic vocabulary, and selecting adjacent subwords which maximize the probability of the language model to be added to the vocabulary;
after merging the words, obtaining new words which are not in a word stock, adding the new words into a vocabulary, and training the language model on the new vocabulary again; repeating these steps until the desired vocabulary is reached
7. The log anomaly detection method based on semantic vectorization representation according to claim 4, wherein:
the BERT model utilizes a BERT base model to input a set of words and sub-wordsEncoding is performed with a 12-layer transform encoder and 768 hidden units per Transformer, each transform layer comprising multi-headed self-care and position feed-forward sub-layers, wherein residual connections are employed in both sub-layers, followed by layer normalization.
8. A method for log anomaly detection based on semantic vectorized representation as claimed in any one of claims 1 to 7 wherein:
the semantic vector corresponding to the vector representation of the fixed dimension of the BERT model coding output is input into a classification model based on a multi-head self-attention mechanism and a convolution layer to detect the abnormality of the log message.
9. The method for detecting log anomalies based on semantic vectorization representation according to claim 8, wherein:
and (3) embedding the positions of the log sequences and inputting the sum of semantic vectors of the log sequences into a classifier model for anomaly detection.
10. The method for detecting log anomalies based on semantic vectorization representation according to claim 9, wherein:
the classifier model comprises an attribute block and a position feedforward layer attribute block which are a mixed structure, global context information is captured by adopting a multi-head self-Attention mechanism, local context information is extracted by adopting a convolution layer, finally global-local context is extracted as output by applying an add operation to the global context and the local context, then the output enters the FN and then enters the position feedforward layer, interlayer characteristics are connected into a feedforward network, the network comprises two fully connected layers, each fully connected layer is provided with a ReLU activation function, data is firstly mapped to a high-dimensional space and then mapped to a low-dimensional space, through normalization processing, the data is moved to an action area of the activation function Relu, nonlinear mapping learning is carried out, and finally the output of the classifier model is fed to a pooling layer, a discarding layer and a fully connected layer, and normal/abnormal log sequences are identified by using a Softmax function.
CN202311611947.1A 2023-11-29 2023-11-29 Log anomaly detection method based on semantic vectorization representation Pending CN117688488A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311611947.1A CN117688488A (en) 2023-11-29 2023-11-29 Log anomaly detection method based on semantic vectorization representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311611947.1A CN117688488A (en) 2023-11-29 2023-11-29 Log anomaly detection method based on semantic vectorization representation

Publications (1)

Publication Number Publication Date
CN117688488A true CN117688488A (en) 2024-03-12

Family

ID=90129279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311611947.1A Pending CN117688488A (en) 2023-11-29 2023-11-29 Log anomaly detection method based on semantic vectorization representation

Country Status (1)

Country Link
CN (1) CN117688488A (en)

Similar Documents

Publication Publication Date Title
Le et al. Log-based anomaly detection without log parsing
Xia et al. Emotion-cause pair extraction: A new task to emotion analysis in texts
Jung Semantic vector learning for natural language understanding
Kacupaj et al. Conversational question answering over knowledge graphs with transformer and graph attention networks
US20230031738A1 (en) Taxpayer industry classification method based on label-noise learning
Zhang et al. Log sequence anomaly detection based on local information extraction and globally sparse transformer model
CN114785606B (en) Log anomaly detection method based on pretrained LogXLnet model, electronic equipment and storage medium
Zhou et al. Deepsyslog: Deep anomaly detection on syslog using sentence embedding and metadata
Logeswaran et al. Sentence ordering using recurrent neural networks
CN115759092A (en) Network threat information named entity identification method based on ALBERT
Jiang et al. Impact of OCR quality on BERT embeddings in the domain classification of book excerpts
CN115617614A (en) Log sequence anomaly detection method based on time interval perception self-attention mechanism
CN116955604A (en) Training method, detection method and device of log detection model
Wang et al. A deep context-wise method for coreference detection in natural language requirements
CN114490954A (en) Document level generation type event extraction method based on task adjustment
Huang et al. Improving log-based anomaly detection by pre-training hierarchical transformers
CN114564950A (en) Electric Chinese named entity recognition method combining word sequence
CN112948588A (en) Chinese text classification method for quick information editing
CN117271701A (en) Method and system for extracting system operation abnormal event relation based on TGGAT and CNN
CN114969334B (en) Abnormal log detection method and device, electronic equipment and readable storage medium
CN113177120B (en) Quick information reorganizing method based on Chinese text classification
CN117688488A (en) Log anomaly detection method based on semantic vectorization representation
Wu A Computational Neural Network Model for College English Grammar Correction
Wang et al. FastTransLog: A Log-based Anomaly Detection Method based on Fastformer
Qian et al. Improved Hierarchical Attention Networks for Cyberbullying Detection via Social Media Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination