CN113312447B - Semi-supervised log anomaly detection method based on probability label estimation - Google Patents

Semi-supervised log anomaly detection method based on probability label estimation Download PDF

Info

Publication number
CN113312447B
CN113312447B CN202110261887.XA CN202110261887A CN113312447B CN 113312447 B CN113312447 B CN 113312447B CN 202110261887 A CN202110261887 A CN 202110261887A CN 113312447 B CN113312447 B CN 113312447B
Authority
CN
China
Prior art keywords
log
vector
label
probability
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110261887.XA
Other languages
Chinese (zh)
Other versions
CN113312447A (en
Inventor
杨林
于瑞国
陈俊洁
王赞
王维靖
姜佳君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110261887.XA priority Critical patent/CN113312447B/en
Publication of CN113312447A publication Critical patent/CN113312447A/en
Application granted granted Critical
Publication of CN113312447B publication Critical patent/CN113312447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a semi-supervised log anomaly detection method based on probability label estimation, which comprises the following steps of 1: vectorizing the given log event data to be detected; step 2, clustering given input logs and automatically labeling probability labels: and 3, training a GRU network model by using the training data including the known normal log event data and the log event data with probability label estimation obtained in the step 2, and detecting log abnormity by adopting the GRU model. Compared with the prior art, the method can quickly and accurately detect the possible abnormity of the software system during operation, and improves the stability of the system; 2) the constructed model has strong robustness and wide applicability; 3) the problems of false alarm and failure of historical information caused by system evolution and log evolution are avoided, and the effect of the whole log anomaly detection method is guaranteed.

Description

Semi-supervised log anomaly detection method based on probability label estimation
Technical Field
The invention relates to the field of operation and maintenance of computer software, in particular to a log-based system anomaly detection method.
Background
Regarding log analysis based system anomaly detection: systems often produce a large number of logs at runtime. The log mainly comprises the logic description of the system operation and the state description of the system operation. The logical description of the system runtime is represented as a series of system "events," such as calls to a certain module, accesses to a certain external interface, and interactions with a database, etc. The series of events describes the different phases that the system is in a certain service. The state description of the system running is represented by a set of 'parameters', such as system CPU usage, memory usage, HTTP request body and response packet volume, and the like. The series of parameters describe a state of the whole system when the system runs to a specific stage, and the series of parameters are a quantitative description of the running of the system. When the system is in an online operating state, system abnormalities may be generated for various reasons. The most common causes are: the method is characterized by comprising the following steps of receiving network attacks from the outside, carrying out traffic exception, and carrying out possible bugs or defects on software. Currently, the industry mainly adopts an anomaly detection means based on logs. After the system is abnormal, the system abnormity detection based on the log is a widely adopted solution, and the staff in charge of system operation and maintenance manually extracts the relevant log in the system for manual analysis. And through the analysis of the log events and the log variables, the abnormal position of the system is found and submitted to a related responsible team for further diagnosis and repair. Firstly, a large amount of domain knowledge is needed for supporting log anomaly detection, and a worker can judge whether an anomaly occurs or not by analyzing the description of the running state of the system in the log only if the worker deeply knows the whole system; secondly, manual anomaly detection often has hysteresis, because of the complexity of the system structure, system anomalies often cause problems in the operation of the system after a period of time, and maintenance personnel are notified to analyze the location and cause of the anomalies. If an anomaly is delayed for a long time to be alleviated or repaired, huge losses are caused to the company.
Regarding semi-supervised machine learning methods: machine learning methods can be currently divided into two broad categories, supervised and unsupervised. Supervised machine learning uses training data with labels to fit features of the training data by training to reduce differences in model predictions and labels. In contrast, unsupervised learning uses training data without labels in order to learn static statistical features in the data, such as clustering, principal component analysis, and the like. Compared with the unsupervised learning method, the supervised learning method has better fitting to the training data and better final result. In contrast, the unsupervised learning method benefits from the fact that labeling of training data is not needed, and in some situations such as log analysis, labeling of data causes a great deal of resource waste or is almost impossible to implement, and the unsupervised learning method is considered preferentially.
In addition to the above two types of machine learning methods, many researchers in recent years have explored how to combine the advantages of both supervised and unsupervised methods and developed semi-supervised machine learning methods. The core idea of the semi-supervised machine learning method is to establish a learner by using model hypothesis on data distribution and label unlabelled data. The advantages are that: only a small amount of labeled data is used, the model effect similar to that of the supervised learning method can be obtained, and therefore the dependence of the supervised learning method on data labeling is reduced. Meanwhile, the final goal of the semi-supervised learning method is to train a supervised learning model mostly by a certain means, so that the overall model effect can be ensured.
Deep Learning (DL) is one of the most popular machine Learning methods in recent years, and has been widely used in many fields. In the field of software engineering, deep learning models have also been studied quite intensively. At present, in the fields of log analysis and the like, a widely used model is mainly a Long-Short Term Memory (LSTM). The LSTM belongs to a cyclic neural network, and state information learned by a model is reserved through steps of circulation, iteration and the like, and is mainly used for processing data related to natural language texts or the like. LSTM has found primary application in the field of log analysis due to the high degree of similarity and inherent potential relationship between logs and natural language.
Currently, there are three main types of implementation methods in the field of system anomaly detection based on log analysis: (1) unsupervised learning methods represented by PCA and LogCluster, (2) supervised learning algorithms represented by LogRobust, and (3) semi-supervised learning methods represented by DeepLog and loganomally. However, the research field of log anomaly detection still faces the following challenges:
1) the existing unsupervised learning method can not achieve high coverage rate on all abnormal types by taking the frequency of the log events appearing in the log sequence as an index for measuring whether the system is abnormal or not. The method for detecting the abnormal state of the system by judging the difference of the number and the types of the log events appearing in the normal log and the abnormal log in the static distribution has the following defects: firstly, the method ignores the sequential relation of the log events in the time dimension, the sequential relation represents the flow of the system in operation, and the method cannot detect the system abnormity caused by the sequential abnormity of the log events. Second, using the frequency of occurrence of log events as a feature does not accommodate the evolution of the log. When a new log appears in the log to be tested or the log changes, the method can only capture part of log events, so that accurate judgment cannot be made.
2) The existing semi-supervised learning method carries out clustering or learning of the normal state of the system through unmarked data or only marked normal log data, and the method has poor effect and often generates a large amount of false alarms. The model may generate a large number of false positives, limited by the number and quality of normal logs. The situation of false alarm is more serious when the log evolves, and any newly generated log or log with changes can be considered as abnormal and reported, thereby even increasing the workload of system operation and maintenance personnel to a certain extent.
3) Supervised machine learning methods, represented by LogRobust, require a large amount of data with normal and abnormal labels to fit. The amount and quality of the training data directly determines the effect of the final model. High quality annotation data is however very difficult to obtain in large quantities. The difficulty is mainly focused on two aspects: firstly, when an exception occurs, due to the complexity of the system and high concurrency of threads, a very large number of logs can be generated at the same time, and a very large amount of labor can be consumed when the logs are distinguished and labeled one by one; secondly, due to the evolution of the log, especially under the high-speed development of the current 'cloud service', the micro-service architecture is widely applied, the system evolution speed is further accelerated, the characteristics in the historical log are often invalid in a short time, and in order to ensure the effectiveness of the supervised learning method, a large amount of continuous manual labeling is required.
How to solve the above three challenges is a technical problem to be solved urgently in the field.
Disclosure of Invention
The invention provides a semi-supervised log anomaly detection method based on probability label estimation, aiming at breaking through the dilemma that the existing log anomaly detection depends on a large amount of manually marked data and the challenge of model misinformation caused by log evolution.
The invention discloses a semi-supervised log anomaly detection method based on probability label estimation, which comprises the following steps of:
step 1, vectorizing log data to be detected:
for a given log event, the invention first extracts the log event through log analysis, i.e. the natural language description of the system running logic in the log. Secondly, by means of the word vector preprocessing result disclosed in the field of natural language processing research at home and abroad at present, the word frequency-inverse document frequency (TF-IDF) weighted summation is carried out on each word in the log event. TF-IDF is originally a technology for extracting key words in natural language paragraphs, and the TF-IDF score is obtained by multiplying the occurrence frequency of words in a certain paragraph by the frequency of inverse documents of the words in the whole natural language corpus. The higher the TF-IDF score, the higher the weight of the corresponding word in the paragraph in which the word is located, which means that the word has a higher contribution to the semantics of the whole paragraph. For words in the log event, the invention multiplies the frequency of the word appearing in the log event by the inverse document frequency of the word appearing in all nonrepeating log events to obtain a weight score, the score is acted on the original semantic vector of the word, and finally the weight vectors of all words in the log event are summed to obtain the vector of the log event. The specific processing and formula of log vectorization are as follows:
in order to accurately extract natural language semantics contained in a log, the influence of characteristic information such as keywords in the natural language on semantic content is combined, a keyword extraction technology 'TF-IDF' (term frequency-inverse document frequency) in natural language processing is used for reference, and improvement is carried out according to related characteristics of log events. For different words in the log event, the TF-IDF weights of the different words in the log event to the log event are comprehensively calculated by combining the appearance frequency of the words in the log event and the appearance frequency of the words in all different logs in a training set; and carrying out weighted summation on the semantic vectors of all words in the log event by using TF-IDF weight, and finally obtaining the semantic vector representation of the log event. For log event E ═ w1,,…wi,…,wn},wiRepresenting different words in the log event, log event vector VEIs calculated as shown in equation (1):
Figure GDA0003567410260000051
wherein, # w represents the number of occurrences of the target word in the log event, # L represents the number of all non-repeating log events in the training set, # LwIndicating the number of log events containing a target word w in a training set, len (e) indicating the length of the whole log event, i.e. the number of total words contained, embed (w) indicating the semantic vector of the target word w in a natural language to finally obtain the vector of the log event, tf (w) indicating the frequency of a certain word appearing in the log event, and idf (w) indicating the frequency of an inverse document of a certain word appearing in all unrepeated log events.
Step 2, automatically labeling the probability label for the given input log:
step 2-1, analyzing and obtaining semantic information contained in a normal log by means of a log generated when a system operates normally, and separating the normal log from an abnormal log by means of an unsupervised clustering algorithm to obtain a marked normal log and a log sequence of 'suspected abnormality';
the unsupervised Clustering algorithm adopts a Hierarchical Density Clustering algorithm (HDBSCAN), the final state of the log is normal or abnormal, and for an unknown label, a neutral position is set between the unknown label and the application with the neutral position representing the unknown label; calculating the non-outlier predicted by the HDBSCAN according to the outlier, wherein the specific calculation formula is shown as a formula (2):
Figure GDA0003567410260000052
equation (2) represents: according to a label y and an outlier given after HDBSCAN clustering of certain data x to be marked, the original label value of 0 or 1 as a label is corrected to a certain number between 0 and 1 according to the outlier, and the label is a probability label P. When the outlier is 1, i.e. the point is the edge point of the cluster, then the point should be considered as the neutral point, and finally the probabilistic label value is taken to be 0.5.
The non-outlier predicted by the HDBSCAN obtained above is also the final label result of the HDBSCAN cluster.
Step 3, adopting a GRU network model to detect log abnormity:
for any given log sequence, S-e1,…,et,…enWherein e istRepresenting the log event at the t-th time in the log sequence, and the iteration process of the GRU network model is as follows:
for a certain time t, firstly, respectively calculating the value z of an 'updating gate' unit according to the hidden layer state at the time t-1tAnd the value r of the "reset gate" unittAs shown in the following formula:
Figure GDA0003567410260000061
wherein x istThe input variable representing the current time is calculated as above in step 1: vectorizing the given log event data to be detected, ht-1The model hidden state variable representing time t-1 is a quantized representation containing information contained from the sequence to time t-1, Wz、Wr、Uz、UrAll the parameters of the GRU network model are GRU network model parameters and are digital matrixes, wherein different subscripts represent different parameters, the parameters of all the GRU network models are continuously adjusted according to training data in the training process of the GRU network model, and the initial values of the parameters are generally randomly initialized according to certain probability distribution (normal distribution is used in the method of the invention);
after obtaining the values z of the above two gating unitst、rtThereafter, the GRU network performs further calculations through these two gating units. Wherein the reset gate unit rtDetermines how many hidden layer states at the time t-1 should be recorded in the 'memory' at the time t, and combines the hidden layer states with the input x at the time ttThe merge is performed to retain some important content throughout the sequence by "resetting" the history information. Updating the gate cell ztAnd determining the final hidden layer state at the current moment, and obtaining the hidden layer state at the final moment t by controlling the hidden layer state at the moment t-1 and the proportion of intermediate memory output of the reset gate. The values of the refresh gate unit and the reset gate unit at a certain time t are a fraction between 0 and 1 in the GRU network, as shown in the following equation:
Figure GDA0003567410260000062
w, U are all GRU network model parameters, similar to the model parameters in the calculation of the previous gating unit values, but independent of each other;
obtaining hidden layer state values at all moments in the iterative process of GRUh1,h2,…,hnAnd then, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full-link layer, and obtaining the normalized classification label through a softmax function. Each hidden layer state of the GRU network represents information included until the sequence is terminated to a corresponding time, and the Pooling layer (Pooling ()) functions to assist in screening of "important" parts of the high-dimensional hidden layer vector. For example, the maximum pooling layer (maxPooling ()) screens out the maximum value in each dimension from all hidden layer states, which represents the state that the corresponding dimension has the most influence in the whole sequence, and the values of the finally screened dimensions constitute a final sequence representation vector O. After being screened by the pooling layer, the MLP () of the full connection layer has the functions of extracting and compressing the relation between different dimensions in a high-dimensional vector and reducing the vector latitude to adapt to the whole classification task. Finally, in order to more intuitively represent the probability, the invention uses a tanh () function to normalize the two-dimensional vector and convert all variables in the two-dimensional vector into decimal numbers in a range of-1 to 1 (such as a softmax () function, a tanh () function and the like). As shown in the following formula:
Figure GDA0003567410260000071
finally, the probability P that the given input log is classified into normal and abnormal is obtained;
the invention uses a self-attention mechanism layer in the selection of the pooling layer of the final model; the self-attention mechanism performs pooling operation by calculating weights of different events at different positions and different weights of different hidden vector dimensions, and the weights can be learned and optimized along with the whole anomaly detection model;
the two phases are combined, so that the robustness of the log anomaly detection method can be integrally improved. The calculation formula of the self-attention mechanism is shown in formula (6):
VS=tanh(WAH+β),H=[h1,h2,…,ht] (6)
wherein, WABeta represents the weight matrix and error vector of the self-attention mechanism learning, H represents the final hidden layer output of the GRU network model, and VSA vector representation representing the final log sequence.
After the vector representation of the log sequence is obtained, the final classification is carried out through a nonlinear transformation, and the operation of the nonlinear transformation is to carry out VsMultiplying by a linear transformation matrix WNon-LinearAnd then, the activation is performed once again by using an activation function, as shown in formula (7):
p (normal, abnormal) ═ tanh (W)Non-Linear Vs) (7)
Wherein, tanh represents a hyperbolic tangent function, also called an activation function, in order to normalize the values in the final probability vector to the range of-1 to 1, so as to implement nonlinear transformation.
Compared with the prior art, the semi-supervised log anomaly detection method (PLELog) based on probability label estimation can achieve the following positive technical effects:
1) the method can quickly and accurately detect possible abnormity during the operation of the software system, and improve the stability of the system;
2) the constructed model has strong robustness and wide applicability, can be deployed in the environment that a general system is difficult to work, such as poor universality of a perception technology, complex human-object interaction and the like, and improves the query efficiency;
3) the influence caused by log evolution is reduced by deeply learning the natural language semantics in the log and combining with an attention mechanism in natural language processing; the problems of false alarm and failure of historical information caused by system evolution and log evolution are avoided, and the effect of the whole log anomaly detection method is guaranteed.
Drawings
FIG. 1 is a flowchart illustrating a semi-supervised log anomaly detection method (PLELog) based on probabilistic tag estimation according to the present invention;
fig. 2 is a schematic diagram of a specific implementation process of a semi-supervised log anomaly detection method (PLELog) based on probabilistic label estimation according to the present invention.
Detailed Description
The technical solution of the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.
The semi-supervised log anomaly detection method (PLELog) based on probability label estimation is realized by adopting Python language, and a PyTorch framework is used as a support library of a deep learning model. The technical scheme of the method mainly comprises the following steps: 1) the method comprises three steps of automatic labeling based on probability label estimation, 2) a log anomaly detection method based on a Gated Recurrent Unit (GRU), and 3) a method for improving the robustness of a log anomaly detection model based on log natural language semantics and a self-attention mechanism, wherein the three steps are main innovation points of the method.
Fig. 1 is a flowchart illustrating an overall method for detecting an anomaly in a semi-supervised log based on probabilistic label estimation according to the present invention. The specific steps are described as follows:
step 1, vectorizing log data to be detected:
for a given log event, the invention first extracts the log event through log analysis, i.e. the natural language description of the system running logic in the log. Secondly, by means of the word vector preprocessing result disclosed in the field of natural language processing research at home and abroad at present, the word frequency-inverse document frequency (TF-IDF) weighted summation is carried out on each word in the log event. TF-IDF is a technology for extracting key words in natural language paragraphs originally, and the TF-IDF score is obtained by multiplying the frequency of the words in a certain paragraph by the frequency of the inverse documents of the words in the whole natural language corpus. The higher the TF-IDF score, the higher the weight of the corresponding word in the paragraph in which the word is located, which means that the word has a higher contribution to the semantics of the whole paragraph. For words in the log event, the invention multiplies the frequency of the word appearing in the log event by the inverse document frequency of the word appearing in all nonrepeating log events to obtain a weight score, the score is acted on the original semantic vector of the word, and finally the weight vectors of all words in the log event are summed to obtain the vector of the log event. The specific processing and formula of log vectorization are as follows:
in order to accurately extract natural language semantics contained in a log, the influence of characteristic information such as keywords in the natural language on semantic content is combined, a keyword extraction technology 'TF-IDF' (term frequency-inverse document frequency) in natural language processing is used for reference, and improvement is carried out according to related characteristics of log events. For different words in the log event, the TF-IDF weights of the different words in the log event to the log event are comprehensively calculated by combining the appearance frequency of the words in the log event and the appearance frequency of the words in all different logs in a training set; and carrying out weighted summation on the semantic vectors of all words in the log event by using TF-IDF weight, and finally obtaining the semantic vector representation of the log event. For log event E ═ w1,,…wi,…,wn},wiRepresenting different words in a log event, log event vector VEIs calculated as shown in equation (1):
Figure GDA0003567410260000091
wherein, # w represents the number of occurrences of the target word in the log event, # L represents the number of all non-repeating log events in the training set, # LwIndicating the number of log events containing a target word w in a training set, len (e) indicating the length of the whole log event, i.e. the number of total words contained, embed (w) indicating the semantic vector of the target word w in a natural language to finally obtain the vector of the log event, tf (w) indicating the frequency of a certain word appearing in the log event, and idf (w) indicating the frequency of an inverse document of a certain word appearing in all unrepeated log events.
Step 2, automatically labeling the probability label for the given input log:
step 2-1, analyzing and obtaining semantic information contained in a normal log by means of a log generated when a system operates normally, and separating the normal log from an abnormal log by means of an unsupervised clustering algorithm to obtain a marked normal log and a log sequence of 'suspected abnormality';
the unsupervised Clustering algorithm adopts a Hierarchical Density Clustering algorithm (HDBSCAN), the final state of the log is normal or abnormal, and for an unknown label, a neutral position is set between the unknown label and the application with the neutral position representing the unknown label; calculating the non-outlier predicted by the HDBSCAN according to the outlier, wherein the specific calculation formula is shown as a formula (2):
Figure GDA0003567410260000101
the formula (2) represents: according to a label y and an outlier given after HDBSCAN clustering of certain data x to be marked, the original label value of 0 or 1 as a label is corrected to a certain number between 0 and 1 according to the outlier, and the label is a probability label P. When the outlier is 1, i.e. the point is the edge point of the cluster, then the point should be considered as the neutral point, and finally the probabilistic label value is taken to be 0.5.
The obtained HDBSCAN predicted non-outlier is also the final label result of the HDBSCAN cluster.
Step 3, training a GRU network model by using the training data including the known normal log event data and the log event data with probability label estimation obtained in the step 2, and performing log anomaly detection by adopting the GRU network model:
for any given log sequence, S-e1,…,et,…enWherein e istRepresenting the log event at the t-th time in the log sequence, the iteration process of the GRU network model is as follows:
for a certain time t, firstly, respectively calculating the value z of an 'updating gate' unit according to the hidden layer state at the time t-1tAnd the value r of the "reset gate" unittAs shown in the following formula:
Figure GDA0003567410260000111
wherein x istThe input variable representing the current time is calculated as above in step 1: vectorizing the given log event data to be detected, ht-1The model hidden state variable representing time t-1 is a quantized representation containing information contained from the sequence to time t-1, Wz、Wr、Uz、UrAll the parameters are GRU network model parameters which are digital matrixes, wherein different subscripts represent different parameters, the parameters of all the GRU network models are continuously adjusted according to training data in the training process of the GRU network models, and the initial values of the parameters are generally randomly initialized according to certain probability distribution (normal distribution is used in the method of the invention);
obtaining the values z of the two gating unitst、rtThereafter, the GRU network performs further calculations through these two gating units. Wherein the reset gate unit rtDetermines how many hidden states at time t-1 should be recorded in the "memory" at time t, and combines the hidden states with input x at time ttThe merge is performed to retain some important content throughout the sequence by "resetting" the history information. Updating the gate cell ztAnd determining the final hidden layer state at the current moment, and obtaining the hidden layer state at the final moment t by controlling the hidden layer state at the moment t-1 and the proportion of intermediate memory output of the reset gate. The values of the refresh gate unit and the reset gate unit at a certain time t are a fraction between 0 and 1 in the GRU network, as shown in the following equation:
Figure GDA0003567410260000112
w, U are all GRU network model parameters, similar to the model parameters in the calculation of the previous gating unit values, but independent of each other;
the iteration process of the GRU obtains hidden layer state values h at all the time1,h2,…,hnAnd then, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full-link layer, and obtaining the normalized classification label through a softmax function. Each hidden layer state of the GRU network represents information included until the sequence is terminated to a corresponding time, and the Pooling layer (Pooling ()) functions to assist in screening of "important" parts of the high-dimensional hidden layer vector. For example, the maximum pooling layer (maxPooling ()) screens out the maximum value in each dimension from all hidden layer states, which represents the state that the corresponding dimension has the most influence in the whole sequence, and the values of the finally screened dimensions constitute a final sequence representation vector O. After being screened by the pooling layer, the MLP () of the full connection layer has the functions of extracting and compressing the relation between different dimensions in a high-dimensional vector and reducing the vector latitude to adapt to the whole classification task. Finally, in order to more intuitively represent the probability, the invention uses a tanh () function to normalize the two-dimensional vector and convert all variables in the two-dimensional vector into decimal numbers in a range of-1 to 1 (such as a softmax () function, a tanh () function and the like). As shown in the following formula:
Figure GDA0003567410260000121
finally, the probability P that the given input log is classified into normal and abnormal is obtained;
the invention uses a self-attention mechanism layer in the selection of the pooling layer of the final model; the self-attention mechanism performs pooling operation by calculating weights of different events at different positions and different weights of different hidden vector dimensions, and the weights can be learned and optimized along with the whole anomaly detection model;
the two phases are combined, so that the robustness of the log anomaly detection method can be integrally improved. The calculation formula of the self-attention mechanism is shown in formula (6):
VS=tanh(WAH+β),H=[h1,h2,…,ht] (6)
wherein, WABeta represents the weight matrix and error vector of the self-attention mechanism learning, H represents the final hidden layer output of the GRU network model, and VSA vector representation representing the final log sequence.
After the vector representation of the log sequence is obtained, the final classification is carried out through a nonlinear transformation, and the operation of the nonlinear transformation is to carry out VsMultiplying by a linear transformation matrix WNon-LinearAnd then, the activation is performed once again by using an activation function, as shown in formula (7):
p (normal, abnormal) is tan h (W)Non-Linear Vs) (7)
Wherein, tanh represents a hyperbolic tangent function, also called an activation function, aiming at normalizing the value in the final probability vector to the range of-1-1 to realize nonlinear transformation.
In order to verify the effectiveness of the log anomaly detection method and improve the effect of the existing log anomaly detection method, the log anomaly detection method performs a comparison experiment on two public log data sets. Three most advanced methods in the field of log anomaly detection at home and abroad at present are selected for comparing effects. The three candidate methods are: the LogCluster method proposed by Lin et al in 2016, the LogAnomaly method proposed by Meng et al in 2019, and the LogRobust method proposed by Zhang et al in 2019. The LogCluster method, like many other unsupervised learning-based methods, uses log event-based frequency of occurrence to represent log sequences. The LogCluster represents log sequences in a log set for training as vectors with fixed length by assisting different weights of different log events on the basis of the occurrence frequency of logs. On the basis of the vector, a multi-layer clustering algorithm is used for carrying out unsupervised clustering to respectively obtain the central point of each cluster. When the unknown logs are detected, the distance from the log expression vector to each cluster central point is respectively calculated to judge the category of the unknown logs. The loganomallly method is a representative of the latest semi-supervised learning method. Unlike LogCluster, the loganomallly method does not use the number of occurrences of a log as a basis for log representation, but represents log events by means of natural language semantics implied by the log. LogAnomaly builds a log sequence model of a log in a normal state by learning the sequence relation of log events in a normal log sequence. Any sequence under test that violates this log sequence model is considered a log sequence containing an anomaly. LogRobust is a typical representative of application of a recurrent neural network represented by LSTM in the field of log anomaly detection in recent years, and constructs a classification model by learning sequence difference and semantic difference in normal and abnormal logs by means of normal and abnormal logs marked in advance to detect unknown log sequences.
The reason why the three methods are selected in the comparative experiment of the present invention is that they represent the latest research progress of the log abnormality detection methods based on unsupervised learning, semi-supervised learning, and supervised learning, respectively, and that their comparison can sufficiently prove the effectiveness of the present invention in the log abnormality detection. As shown in table 1, there are three comparison methods and the log anomaly detection effect of the present invention on HDFS and BGL.
TABLE 1
Figure GDA0003567410260000131
Figure GDA0003567410260000141
The HDFS data is log data generated by distributed tasks such as MapReduce and the like operated by a Hadoop cluster, and normal log data and abnormal log data are generated by means of artificial injection of abnormality and the like. The BGL data is a log generated by the running of the super computer, and because the running naturally generates various data, a foreign expert team manually marks and opens the source of a part of the data (the time span exceeds 200 days), and contributes the part of the data to research and use. It should be noted that, the BGL data set has a long time span, and the log evolves during the time span, and the present invention uses the BGL data as a representative of log evolution to verify the effectiveness of the present invention in a log evolution scenario. In contrast, the HDFS data set is short in overall time, the log is not changed, and the HDFS data set is used for simulating a traditional abnormal detection scene based on the stable log.
In the aspect of measurement indexes, the effectiveness measurement indexes of the traditional classification method are used for evaluating the effectiveness of the method, namely the accuracy, the recall rate and the F1 value. The accuracy rate refers to how many log anomalies detected by the model are real anomaly logs. Recall refers to how many anomalies, of all types of anomalies, can be detected by the model. Accuracy and recall are somewhat two opposing metrics, as classification models typically set a set of thresholds to determine the final classification of the current data. This threshold is set differently, resulting in varying accuracy and recall. In order to unify the accuracy and the recall ratio into one index and facilitate quantitative analysis, researchers often use F1 value, i.e. the geometric mean value of the accuracy and the recall ratio, to measure the overall model. The calculation method of the F1 value is shown in equation (8):
Figure GDA0003567410260000142
according to the results in table 1, the present invention can be found to have an average detection accuracy of more than 95% on two data sets. Meanwhile, the method can also find more than 96% of different types of abnormalities, and the overall performance is excellent.
Compared with the existing semi-supervised and unsupervised methods, the method has the advantages that on the basis of the HDFS data set (under the scene that logs are relatively stable), the method is obviously improved, and the accuracy and the recall rate are improved. On the other hand, in the setting of the comparison experiment, the semi-supervised learning method can achieve the effect similar to a complete supervised learning algorithm by comparing with the LogRobust, so that various resources required by manual labeling are greatly saved, and the efficiency of log anomaly detection is improved.
The results on the data set of the BGL can prove that the method has very good applicability to the scenes of log evolution. The LogCluster can have relatively high accuracy for the log anomaly detection problem in the log evolution scene, but cannot completely cover the newly generated anomaly type. In contrast, the loganomally method generates a large amount of false alarms, and the accuracy of the overall method is very low, because the normal state learned from the history information cannot be applied to the scene after the log changes, a large amount of false alarms are generated. The invention combines the advantages of an unsupervised learning algorithm and a supervised learning algorithm, and solves the problem of false alarm in a log evolution scene through abstracting log natural language semantics, automatic labeling, combining a GRU network model of a self-attention mechanism and the like. Compared with the existing method, the method has the advantage that the effect of log anomaly detection is obviously improved. On the other hand, the log anomaly detection method is compared with LogRobust, and further proves that the log anomaly detection method can achieve log anomaly detection effect similar to completely supervised learning.
On the basis, in order to further prove that the method can accurately and effectively detect the abnormal data generated in real production life, the method also performs experiments on software logs in two domestic industries, and the results of the log abnormal detection effect experiment on two real systems are shown in table 2.
TABLE 2
Figure GDA0003567410260000151
The experimental results show that the method has very good detection capability on log anomalies generated in the real world, and meanwhile, the recall rates respectively reach 100% and 99.1%, which shows that the method can almost detect all types of anomalies. This set of experiments further validated the effectiveness of the present invention.
In order to accurately prove the effectiveness of the log anomaly detection technology based on probability label estimation, the invention adopts the idea of control variables and carries out independent comparison test on each component in the technology provided by the invention. To further highlight the effectiveness of the present invention, experiments were performed on the BGL data set using the technique of the present invention, PLELog and its variant techniques. As shown in table 3, experimental effects on the BGL dataset for different PLELog variants.
TABLE 3
Figure GDA0003567410260000161
As can be seen from the experimental results in table 3, the probability label estimation can effectively ensure that the method provided by the present invention is not affected by the "noise" data caused by automatic labeling, and the probability label estimation improves the accuracy of log anomaly detection by about 37%. On the other hand, the self-attention mechanism also well helps to solve the influence caused by log evolution, and the overall effect is improved by about 11%. This set of experimental results further verifies the validity of the proposed method and the necessity of the individual components.
The invention provides a log anomaly detection method with less manual intervention by utilizing the thought of semi-supervised learning and combining a clustering method and a supervised learning method for the first time. Meanwhile, the method also combines probability label estimation and a self-attention mechanism to ensure that the system abnormity can be accurately detected when the log evolves, and the overall robustness of the method is improved. The invention verifies the effectiveness of the invention on the key content of log anomaly detection by comparing two public data sets with the prior most advanced method. Meanwhile, the invention also performs tests on two real data sets in the industry, and further verifies the effect of the invention in actual production and life scenes.

Claims (1)

1. A semi-supervised log anomaly detection method based on probability label estimation is characterized by comprising the following steps:
step 1: vectorizing the given log event data to be detected:
firstly, extracting a log event through log analysis, namely natural language description of system operation logic in the log; secondly, multiplying the frequency TF (w) of a word appearing in the log event by the inverse document frequency IDF (w) of the word appearing in all nonrepeating log events to obtain a weight score, applying the weight score to the original semantic vector of the word, and finally summing the weight vectors of all words in the log event to obtain a vector V of the log eventE
For log event E ═ w1,…wi,…,wn},wiRepresenting different words in a log event, log event vector VEIs calculated as shown in equation (1):
Figure FDA0003542053120000011
wherein, # w represents the number of occurrences of the target word in the log event, # L represents the number of all non-repeating log events in the training set, # LwRepresenting the number of log events containing a target word w in a training set, len (E) representing the length of the whole log event, and embed (w) representing the semantic vector of the target word w in a natural language to finally obtain a vector V of the log eventE
Step 2, clustering the given input logs and automatically labeling the probability labels:
for the data of the unknown label, clustering is realized by using an unsupervised clustering algorithm together with the marked normal log to obtain a clustering result and a corresponding outlier, and a non-outlier predicted by the HDBSCAN is calculated according to the outlier, wherein a specific calculation formula (2) is as follows:
Figure FDA0003542053120000021
according to a formula (2), a label value which is originally 0 or 1 and is a label is corrected to be a certain number between 0 and 1 according to an outlier, namely a probability label P, according to a label y and the outlier which are given by a certain data x to be marked after the HDBSCAN is clustered;
then, the probability label P of each cluster is analyzed, if a certain clustering result contains a known normal log, the clustering result is a set of normal logs, otherwise, the clustering result is a set of abnormal logs;
for unmarked log event data in the cluster, giving a probability label to replace the original cluster label by the formula (2);
step 3, training a GRU network model by using the training data including the known normal log event data and the log event data with probability label estimation obtained in the step 2, and performing log anomaly detection by adopting the GRU network model:
for any given log sequence S-e1,…,et,…enWherein e istRepresenting the log event at the t-th time in the log sequence, the iteration process of the GRU network model is as follows:
for a certain time t, firstly, respectively calculating values z of updated gate units according to hidden layer states at the time t-1tAnd resetting the value r of the gate unittAs shown in equation (3):
Figure FDA0003542053120000022
wherein x istAn input variable, h, representing the current timet-1The model hidden state variable representing time t-1 is a quantized representation containing information contained from the sequence to time t-1, Wz、Wr、Uz、UrAll the parameters of the GRU network model are parameters of the GRU network model and are digital matrixes, wherein the parameters represented by different subscripts are different, the parameters of all the GRU network models are continuously adjusted according to training data in the training process of the GRU network model, and the initial values of the parameters are randomly initialized according to certain probability distribution;
after obtaining the values z of the above two gating unitst、rtThen, the GRU network model carries out further calculation through the values of the two gating units; wherein the value r of the gate unit is resettDetermining how many hidden layer states at t-1 time should be recorded in the memory of t time, and combining the hidden layer states with input x at t timetMerging, namely, reserving important contents in the whole sequence through resetting the historical information; updating the value z of the gate celltDetermining the final hidden layer state at the current moment, and obtaining the final hidden layer state at the t moment by controlling the hidden layer state at the t-1 moment and the proportion occupied by intermediate memory output by the reset gate unit; the value of the refresh gate unit and the value of the reset gate unit at a certain time t are a decimal between 0 and 1 in the GRU network model, as shown in formula (4):
Figure FDA0003542053120000031
w, U are all GRU network model parameters;
obtaining hidden layer state values h at all moments in the iterative process of the GRU network model1,h2,…,hnThen, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full connection layer MLP (), and obtaining a normalized classification label through a softmax function; the pooling layer comprises a maximum pooling layer maxPooling (), each hidden layer state of the GRU network model represents information contained until a sequence is cut off to a corresponding moment, the maximum pooling layer maxPooling () screens out the maximum value in each dimension from all hidden layer states and represents the state with the most influence of the corresponding dimension in the whole sequence, the finally screened value of each dimension forms a final sequence representation vector O, after the maximum pooling layer maxPooling () is used for screening, a full connection layer MLP () extracts and compresses the relation between different dimensions in a high-dimensional vector, and the output of the full connection layer MLP () is a two-dimensional vector and represents the probability that a given sequence is classified into normal and abnormal respectively; finally, the two-dimensional vector is normalized using the tanh () function, where all variables are converted to the-1-1 normDecimal inside the enclosure; as shown in equation (5):
Figure FDA0003542053120000032
finally, the probability P that the given input log is classified into normal and abnormal is obtained;
performing pooling operation by calculating weights of different events at different positions and different weights of different hidden vector dimensions by using a self-attention mechanism layer on the selection of a pooling layer of the final model;
the calculation formula of the self-attention mechanism layer is shown in formula (6):
VS=tanh(WAH+β),H=[h1,h2,…,ht] (6)
wherein, WABeta represents the weight matrix and error vector of the self-attention mechanism learning, H represents the final hidden layer output of the GRU network model, and VSA vector representation representing the final log sequence;
obtaining a vector representation V of a log sequenceSThen, the final classification is performed by a non-linear transformation, as shown in equation (7):
p (normal, abnormal) ═ tanh (W)Non-LinearVs) (7)
And the tanh represents an activation function, and aims to normalize the value in the final probability vector to be in a range of-1 to 1 so as to realize nonlinear transformation.
CN202110261887.XA 2021-03-10 2021-03-10 Semi-supervised log anomaly detection method based on probability label estimation Active CN113312447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110261887.XA CN113312447B (en) 2021-03-10 2021-03-10 Semi-supervised log anomaly detection method based on probability label estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110261887.XA CN113312447B (en) 2021-03-10 2021-03-10 Semi-supervised log anomaly detection method based on probability label estimation

Publications (2)

Publication Number Publication Date
CN113312447A CN113312447A (en) 2021-08-27
CN113312447B true CN113312447B (en) 2022-07-12

Family

ID=77371831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110261887.XA Active CN113312447B (en) 2021-03-10 2021-03-10 Semi-supervised log anomaly detection method based on probability label estimation

Country Status (1)

Country Link
CN (1) CN113312447B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657121B (en) * 2021-09-03 2023-04-07 四川大学 Log variable semantic annotation method
CN114422267B (en) * 2022-03-03 2024-02-06 北京天融信网络安全技术有限公司 Flow detection method, device, equipment and medium
CN114398898B (en) * 2022-03-24 2022-06-24 三峡智控科技有限公司 Method for generating KPI curve and marking wave band characteristics based on log event relation
CN115204318B (en) * 2022-09-15 2022-12-02 天津汇智星源信息技术有限公司 Event automatic hierarchical classification method and electronic equipment
CN115357469B (en) * 2022-10-21 2022-12-30 北京国电通网络技术有限公司 Abnormal alarm log analysis method and device, electronic equipment and computer medium
CN116484260B (en) * 2023-04-28 2024-03-19 南京信息工程大学 Semi-supervised log anomaly detection method based on bidirectional time convolution network
CN116910682B (en) * 2023-09-14 2023-12-05 中移(苏州)软件技术有限公司 Event detection method and device, electronic equipment and storage medium
CN117149500B (en) * 2023-10-30 2024-01-26 安徽思高智能科技有限公司 Abnormal root cause obtaining method and system based on index data and log data
CN117349740A (en) * 2023-11-01 2024-01-05 上海鼎茂信息技术有限公司 Micro-service architecture-oriented exception detection algorithm for fusing log and call chain data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110381079A (en) * 2019-07-31 2019-10-25 福建师范大学 Network log method for detecting abnormality is carried out in conjunction with GRU and SVDD
CN111209168A (en) * 2020-01-14 2020-05-29 中国人民解放军陆军炮兵防空兵学院郑州校区 Log sequence anomaly detection framework based on nLSTM-self attention
CN111371806A (en) * 2020-03-18 2020-07-03 北京邮电大学 Web attack detection method and device
CN111930903A (en) * 2020-06-30 2020-11-13 山东师范大学 System anomaly detection method and system based on deep log sequence analysis
CN112395159A (en) * 2020-11-17 2021-02-23 华为技术有限公司 Log detection method, system, device and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552609B (en) * 2020-04-12 2022-03-11 西安电子科技大学 Abnormal state detection method, system, storage medium, program and server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110381079A (en) * 2019-07-31 2019-10-25 福建师范大学 Network log method for detecting abnormality is carried out in conjunction with GRU and SVDD
CN111209168A (en) * 2020-01-14 2020-05-29 中国人民解放军陆军炮兵防空兵学院郑州校区 Log sequence anomaly detection framework based on nLSTM-self attention
CN111371806A (en) * 2020-03-18 2020-07-03 北京邮电大学 Web attack detection method and device
CN111930903A (en) * 2020-06-30 2020-11-13 山东师范大学 System anomaly detection method and system based on deep log sequence analysis
CN112395159A (en) * 2020-11-17 2021-02-23 华为技术有限公司 Log detection method, system, device and medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Mining Unstructured Log Messages for Security Threat Detection;Candace Suh-Lee;《UNLV Theses,Dissertations,Professional Papers,and Capstones》;20160501;全文 *
基于日志的软件系统行为异常检测;李键;《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑(月刊)》;20200115(第01期);全文 *
基于深度学习的系统日志异常检测研究;王易东等;《网络与信息安全学报》;20191031;第5卷(第5期);全文 *

Also Published As

Publication number Publication date
CN113312447A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN113312447B (en) Semi-supervised log anomaly detection method based on probability label estimation
CN112784965B (en) Large-scale multi-element time series data anomaly detection method oriented to cloud environment
CN113434357B (en) Log anomaly detection method and device based on sequence prediction
CN111914873A (en) Two-stage cloud server unsupervised anomaly prediction method
WO2021139279A1 (en) Data processing method and apparatus based on classification model, and electronic device and medium
Chen et al. Time series data for equipment reliability analysis with deep learning
CN112966714B (en) Edge time sequence data anomaly detection and network programmable control method
Tao et al. A network intrusion detection model based on convolutional neural network
Xie et al. Attention Mechanism‐Based CNN‐LSTM Model for Wind Turbine Fault Prediction Using SSN Ontology Annotation
CN113595998A (en) Bi-LSTM-based power grid information system vulnerability attack detection method and device
Li et al. Research on robustness of five typical data-driven fault diagnosis models for nuclear power plants
CN110956309A (en) Flow activity prediction method based on CRF and LSTM
CN115017513A (en) Intelligent contract vulnerability detection method based on artificial intelligence
CN117081831A (en) Network intrusion detection method and system based on data generation and attention mechanism
CN116541838A (en) Malware detection method based on contrast learning
Mao et al. Explainable software vulnerability detection based on attention-based bidirectional recurrent neural networks
CN116305119A (en) APT malicious software classification method and device based on predictive guidance prototype
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
Xu et al. TLS-WGAN-GP: A generative adversarial network model for data-driven fault root cause location
CN114416479A (en) Log sequence anomaly detection method based on out-of-stream regularization
Zhang et al. CPVD: Cross project vulnerability detection based on graph attention network and domain adaptation
CN112613032B (en) Host intrusion detection method and device based on system call sequence
CN112559741B (en) Nuclear power equipment defect record text classification method, system, medium and electronic equipment
CN109508544B (en) Intrusion detection method based on MLP
Huo et al. Traffic anomaly detection method based on improved GRU and EFMS-Kmeans clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant