CN113312447B

CN113312447B - Semi-supervised log anomaly detection method based on probability label estimation

Info

Publication number: CN113312447B
Application number: CN202110261887.XA
Authority: CN
Inventors: 杨林; 于瑞国; 陈俊洁; 王赞; 王维靖; 姜佳君
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2022-07-12
Anticipated expiration: 2041-03-10
Also published as: CN113312447A

Abstract

The invention discloses a semi-supervised log anomaly detection method based on probability label estimation, which comprises the following steps of 1: vectorizing the given log event data to be detected; step 2, clustering given input logs and automatically labeling probability labels: and 3, training a GRU network model by using the training data including the known normal log event data and the log event data with probability label estimation obtained in the step 2, and detecting log abnormity by adopting the GRU model. Compared with the prior art, the method can quickly and accurately detect the possible abnormity of the software system during operation, and improves the stability of the system; 2) the constructed model has strong robustness and wide applicability; 3) the problems of false alarm and failure of historical information caused by system evolution and log evolution are avoided, and the effect of the whole log anomaly detection method is guaranteed.

Description

Semi-supervised log anomaly detection method based on probability label estimation

Technical Field

The invention relates to the field of operation and maintenance of computer software, in particular to a log-based system anomaly detection method.

Background

Regarding log analysis based system anomaly detection: systems often produce a large number of logs at runtime. The log mainly comprises the logic description of the system operation and the state description of the system operation. The logical description of the system runtime is represented as a series of system "events," such as calls to a certain module, accesses to a certain external interface, and interactions with a database, etc. The series of events describes the different phases that the system is in a certain service. The state description of the system running is represented by a set of 'parameters', such as system CPU usage, memory usage, HTTP request body and response packet volume, and the like. The series of parameters describe a state of the whole system when the system runs to a specific stage, and the series of parameters are a quantitative description of the running of the system. When the system is in an online operating state, system abnormalities may be generated for various reasons. The most common causes are: the method is characterized by comprising the following steps of receiving network attacks from the outside, carrying out traffic exception, and carrying out possible bugs or defects on software. Currently, the industry mainly adopts an anomaly detection means based on logs. After the system is abnormal, the system abnormity detection based on the log is a widely adopted solution, and the staff in charge of system operation and maintenance manually extracts the relevant log in the system for manual analysis. And through the analysis of the log events and the log variables, the abnormal position of the system is found and submitted to a related responsible team for further diagnosis and repair. Firstly, a large amount of domain knowledge is needed for supporting log anomaly detection, and a worker can judge whether an anomaly occurs or not by analyzing the description of the running state of the system in the log only if the worker deeply knows the whole system; secondly, manual anomaly detection often has hysteresis, because of the complexity of the system structure, system anomalies often cause problems in the operation of the system after a period of time, and maintenance personnel are notified to analyze the location and cause of the anomalies. If an anomaly is delayed for a long time to be alleviated or repaired, huge losses are caused to the company.

Regarding semi-supervised machine learning methods: machine learning methods can be currently divided into two broad categories, supervised and unsupervised. Supervised machine learning uses training data with labels to fit features of the training data by training to reduce differences in model predictions and labels. In contrast, unsupervised learning uses training data without labels in order to learn static statistical features in the data, such as clustering, principal component analysis, and the like. Compared with the unsupervised learning method, the supervised learning method has better fitting to the training data and better final result. In contrast, the unsupervised learning method benefits from the fact that labeling of training data is not needed, and in some situations such as log analysis, labeling of data causes a great deal of resource waste or is almost impossible to implement, and the unsupervised learning method is considered preferentially.

In addition to the above two types of machine learning methods, many researchers in recent years have explored how to combine the advantages of both supervised and unsupervised methods and developed semi-supervised machine learning methods. The core idea of the semi-supervised machine learning method is to establish a learner by using model hypothesis on data distribution and label unlabelled data. The advantages are that: only a small amount of labeled data is used, the model effect similar to that of the supervised learning method can be obtained, and therefore the dependence of the supervised learning method on data labeling is reduced. Meanwhile, the final goal of the semi-supervised learning method is to train a supervised learning model mostly by a certain means, so that the overall model effect can be ensured.

Deep Learning (DL) is one of the most popular machine Learning methods in recent years, and has been widely used in many fields. In the field of software engineering, deep learning models have also been studied quite intensively. At present, in the fields of log analysis and the like, a widely used model is mainly a Long-Short Term Memory (LSTM). The LSTM belongs to a cyclic neural network, and state information learned by a model is reserved through steps of circulation, iteration and the like, and is mainly used for processing data related to natural language texts or the like. LSTM has found primary application in the field of log analysis due to the high degree of similarity and inherent potential relationship between logs and natural language.

Currently, there are three main types of implementation methods in the field of system anomaly detection based on log analysis: (1) unsupervised learning methods represented by PCA and LogCluster, (2) supervised learning algorithms represented by LogRobust, and (3) semi-supervised learning methods represented by DeepLog and loganomally. However, the research field of log anomaly detection still faces the following challenges:

1) the existing unsupervised learning method can not achieve high coverage rate on all abnormal types by taking the frequency of the log events appearing in the log sequence as an index for measuring whether the system is abnormal or not. The method for detecting the abnormal state of the system by judging the difference of the number and the types of the log events appearing in the normal log and the abnormal log in the static distribution has the following defects: firstly, the method ignores the sequential relation of the log events in the time dimension, the sequential relation represents the flow of the system in operation, and the method cannot detect the system abnormity caused by the sequential abnormity of the log events. Second, using the frequency of occurrence of log events as a feature does not accommodate the evolution of the log. When a new log appears in the log to be tested or the log changes, the method can only capture part of log events, so that accurate judgment cannot be made.

2) The existing semi-supervised learning method carries out clustering or learning of the normal state of the system through unmarked data or only marked normal log data, and the method has poor effect and often generates a large amount of false alarms. The model may generate a large number of false positives, limited by the number and quality of normal logs. The situation of false alarm is more serious when the log evolves, and any newly generated log or log with changes can be considered as abnormal and reported, thereby even increasing the workload of system operation and maintenance personnel to a certain extent.

3) Supervised machine learning methods, represented by LogRobust, require a large amount of data with normal and abnormal labels to fit. The amount and quality of the training data directly determines the effect of the final model. High quality annotation data is however very difficult to obtain in large quantities. The difficulty is mainly focused on two aspects: firstly, when an exception occurs, due to the complexity of the system and high concurrency of threads, a very large number of logs can be generated at the same time, and a very large amount of labor can be consumed when the logs are distinguished and labeled one by one; secondly, due to the evolution of the log, especially under the high-speed development of the current 'cloud service', the micro-service architecture is widely applied, the system evolution speed is further accelerated, the characteristics in the historical log are often invalid in a short time, and in order to ensure the effectiveness of the supervised learning method, a large amount of continuous manual labeling is required.

How to solve the above three challenges is a technical problem to be solved urgently in the field.

Disclosure of Invention

The invention provides a semi-supervised log anomaly detection method based on probability label estimation, aiming at breaking through the dilemma that the existing log anomaly detection depends on a large amount of manually marked data and the challenge of model misinformation caused by log evolution.

The invention discloses a semi-supervised log anomaly detection method based on probability label estimation, which comprises the following steps of:

step 1, vectorizing log data to be detected:

for a given log event, the invention first extracts the log event through log analysis, i.e. the natural language description of the system running logic in the log. Secondly, by means of the word vector preprocessing result disclosed in the field of natural language processing research at home and abroad at present, the word frequency-inverse document frequency (TF-IDF) weighted summation is carried out on each word in the log event. TF-IDF is originally a technology for extracting key words in natural language paragraphs, and the TF-IDF score is obtained by multiplying the occurrence frequency of words in a certain paragraph by the frequency of inverse documents of the words in the whole natural language corpus. The higher the TF-IDF score, the higher the weight of the corresponding word in the paragraph in which the word is located, which means that the word has a higher contribution to the semantics of the whole paragraph. For words in the log event, the invention multiplies the frequency of the word appearing in the log event by the inverse document frequency of the word appearing in all nonrepeating log events to obtain a weight score, the score is acted on the original semantic vector of the word, and finally the weight vectors of all words in the log event are summed to obtain the vector of the log event. The specific processing and formula of log vectorization are as follows:

in order to accurately extract natural language semantics contained in a log, the influence of characteristic information such as keywords in the natural language on semantic content is combined, a keyword extraction technology 'TF-IDF' (term frequency-inverse document frequency) in natural language processing is used for reference, and improvement is carried out according to related characteristics of log events. For different words in the log event, the TF-IDF weights of the different words in the log event to the log event are comprehensively calculated by combining the appearance frequency of the words in the log event and the appearance frequency of the words in all different logs in a training set; and carrying out weighted summation on the semantic vectors of all words in the log event by using TF-IDF weight, and finally obtaining the semantic vector representation of the log event. For log event E ═ w₁,,…w_i,…,w_n}，w_iRepresenting different words in the log event, log event vector V_EIs calculated as shown in equation (1):

wherein, # w represents the number of occurrences of the target word in the log event, # L represents the number of all non-repeating log events in the training set, # L_wIndicating the number of log events containing a target word w in a training set, len (e) indicating the length of the whole log event, i.e. the number of total words contained, embed (w) indicating the semantic vector of the target word w in a natural language to finally obtain the vector of the log event, tf (w) indicating the frequency of a certain word appearing in the log event, and idf (w) indicating the frequency of an inverse document of a certain word appearing in all unrepeated log events.

Step 2, automatically labeling the probability label for the given input log:

step 2-1, analyzing and obtaining semantic information contained in a normal log by means of a log generated when a system operates normally, and separating the normal log from an abnormal log by means of an unsupervised clustering algorithm to obtain a marked normal log and a log sequence of 'suspected abnormality';

the unsupervised Clustering algorithm adopts a Hierarchical Density Clustering algorithm (HDBSCAN), the final state of the log is normal or abnormal, and for an unknown label, a neutral position is set between the unknown label and the application with the neutral position representing the unknown label; calculating the non-outlier predicted by the HDBSCAN according to the outlier, wherein the specific calculation formula is shown as a formula (2):

equation (2) represents: according to a label y and an outlier given after HDBSCAN clustering of certain data x to be marked, the original label value of 0 or 1 as a label is corrected to a certain number between 0 and 1 according to the outlier, and the label is a probability label P. When the outlier is 1, i.e. the point is the edge point of the cluster, then the point should be considered as the neutral point, and finally the probabilistic label value is taken to be 0.5.

The non-outlier predicted by the HDBSCAN obtained above is also the final label result of the HDBSCAN cluster.

Step 3, adopting a GRU network model to detect log abnormity:

for any given log sequence, S-e₁,…，e_t,…e_nWherein e is_tRepresenting the log event at the t-th time in the log sequence, and the iteration process of the GRU network model is as follows:

for a certain time t, firstly, respectively calculating the value z of an 'updating gate' unit according to the hidden layer state at the time t-1_tAnd the value r of the "reset gate" unit_tAs shown in the following formula:

wherein x is_tThe input variable representing the current time is calculated as above in step 1: vectorizing the given log event data to be detected, h_t-1The model hidden state variable representing time t-1 is a quantized representation containing information contained from the sequence to time t-1, W_z、W_r、U_z、U_rAll the parameters of the GRU network model are GRU network model parameters and are digital matrixes, wherein different subscripts represent different parameters, the parameters of all the GRU network models are continuously adjusted according to training data in the training process of the GRU network model, and the initial values of the parameters are generally randomly initialized according to certain probability distribution (normal distribution is used in the method of the invention);

after obtaining the values z of the above two gating units_t、r_tThereafter, the GRU network performs further calculations through these two gating units. Wherein the reset gate unit r_tDetermines how many hidden layer states at the time t-1 should be recorded in the 'memory' at the time t, and combines the hidden layer states with the input x at the time t_tThe merge is performed to retain some important content throughout the sequence by "resetting" the history information. Updating the gate cell z_tAnd determining the final hidden layer state at the current moment, and obtaining the hidden layer state at the final moment t by controlling the hidden layer state at the moment t-1 and the proportion of intermediate memory output of the reset gate. The values of the refresh gate unit and the reset gate unit at a certain time t are a fraction between 0 and 1 in the GRU network, as shown in the following equation:

w, U are all GRU network model parameters, similar to the model parameters in the calculation of the previous gating unit values, but independent of each other;

obtaining hidden layer state values at all moments in the iterative process of GRUh₁,h₂,…,h_nAnd then, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full-link layer, and obtaining the normalized classification label through a softmax function. Each hidden layer state of the GRU network represents information included until the sequence is terminated to a corresponding time, and the Pooling layer (Pooling ()) functions to assist in screening of "important" parts of the high-dimensional hidden layer vector. For example, the maximum pooling layer (maxPooling ()) screens out the maximum value in each dimension from all hidden layer states, which represents the state that the corresponding dimension has the most influence in the whole sequence, and the values of the finally screened dimensions constitute a final sequence representation vector O. After being screened by the pooling layer, the MLP () of the full connection layer has the functions of extracting and compressing the relation between different dimensions in a high-dimensional vector and reducing the vector latitude to adapt to the whole classification task. Finally, in order to more intuitively represent the probability, the invention uses a tanh () function to normalize the two-dimensional vector and convert all variables in the two-dimensional vector into decimal numbers in a range of-1 to 1 (such as a softmax () function, a tanh () function and the like). As shown in the following formula:

finally, the probability P that the given input log is classified into normal and abnormal is obtained;

the invention uses a self-attention mechanism layer in the selection of the pooling layer of the final model; the self-attention mechanism performs pooling operation by calculating weights of different events at different positions and different weights of different hidden vector dimensions, and the weights can be learned and optimized along with the whole anomaly detection model;

the two phases are combined, so that the robustness of the log anomaly detection method can be integrally improved. The calculation formula of the self-attention mechanism is shown in formula (6):

V_S＝tanh(W_AH+β),H＝[h₁,h₂,…,h_t] (6)

wherein, W_ABeta represents the weight matrix and error vector of the self-attention mechanism learning, H represents the final hidden layer output of the GRU network model, and V_SA vector representation representing the final log sequence.

After the vector representation of the log sequence is obtained, the final classification is carried out through a nonlinear transformation, and the operation of the nonlinear transformation is to carry out V_sMultiplying by a linear transformation matrix W_Non-LinearAnd then, the activation is performed once again by using an activation function, as shown in formula (7):

p (normal, abnormal) ═ tanh (W)_Non-Linear V_s) (7)

Wherein, tanh represents a hyperbolic tangent function, also called an activation function, in order to normalize the values in the final probability vector to the range of-1 to 1, so as to implement nonlinear transformation.

Compared with the prior art, the semi-supervised log anomaly detection method (PLELog) based on probability label estimation can achieve the following positive technical effects:

1) the method can quickly and accurately detect possible abnormity during the operation of the software system, and improve the stability of the system;

2) the constructed model has strong robustness and wide applicability, can be deployed in the environment that a general system is difficult to work, such as poor universality of a perception technology, complex human-object interaction and the like, and improves the query efficiency;

3) the influence caused by log evolution is reduced by deeply learning the natural language semantics in the log and combining with an attention mechanism in natural language processing; the problems of false alarm and failure of historical information caused by system evolution and log evolution are avoided, and the effect of the whole log anomaly detection method is guaranteed.

Drawings

FIG. 1 is a flowchart illustrating a semi-supervised log anomaly detection method (PLELog) based on probabilistic tag estimation according to the present invention;

fig. 2 is a schematic diagram of a specific implementation process of a semi-supervised log anomaly detection method (PLELog) based on probabilistic label estimation according to the present invention.

Detailed Description

The technical solution of the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

The semi-supervised log anomaly detection method (PLELog) based on probability label estimation is realized by adopting Python language, and a PyTorch framework is used as a support library of a deep learning model. The technical scheme of the method mainly comprises the following steps: 1) the method comprises three steps of automatic labeling based on probability label estimation, 2) a log anomaly detection method based on a Gated Recurrent Unit (GRU), and 3) a method for improving the robustness of a log anomaly detection model based on log natural language semantics and a self-attention mechanism, wherein the three steps are main innovation points of the method.

Fig. 1 is a flowchart illustrating an overall method for detecting an anomaly in a semi-supervised log based on probabilistic label estimation according to the present invention. The specific steps are described as follows:

step 1, vectorizing log data to be detected:

for a given log event, the invention first extracts the log event through log analysis, i.e. the natural language description of the system running logic in the log. Secondly, by means of the word vector preprocessing result disclosed in the field of natural language processing research at home and abroad at present, the word frequency-inverse document frequency (TF-IDF) weighted summation is carried out on each word in the log event. TF-IDF is a technology for extracting key words in natural language paragraphs originally, and the TF-IDF score is obtained by multiplying the frequency of the words in a certain paragraph by the frequency of the inverse documents of the words in the whole natural language corpus. The higher the TF-IDF score, the higher the weight of the corresponding word in the paragraph in which the word is located, which means that the word has a higher contribution to the semantics of the whole paragraph. For words in the log event, the invention multiplies the frequency of the word appearing in the log event by the inverse document frequency of the word appearing in all nonrepeating log events to obtain a weight score, the score is acted on the original semantic vector of the word, and finally the weight vectors of all words in the log event are summed to obtain the vector of the log event. The specific processing and formula of log vectorization are as follows:

in order to accurately extract natural language semantics contained in a log, the influence of characteristic information such as keywords in the natural language on semantic content is combined, a keyword extraction technology 'TF-IDF' (term frequency-inverse document frequency) in natural language processing is used for reference, and improvement is carried out according to related characteristics of log events. For different words in the log event, the TF-IDF weights of the different words in the log event to the log event are comprehensively calculated by combining the appearance frequency of the words in the log event and the appearance frequency of the words in all different logs in a training set; and carrying out weighted summation on the semantic vectors of all words in the log event by using TF-IDF weight, and finally obtaining the semantic vector representation of the log event. For log event E ═ w₁,,…w_i,…,w_n}，w_iRepresenting different words in a log event, log event vector V_EIs calculated as shown in equation (1):

Step 2, automatically labeling the probability label for the given input log:

the formula (2) represents: according to a label y and an outlier given after HDBSCAN clustering of certain data x to be marked, the original label value of 0 or 1 as a label is corrected to a certain number between 0 and 1 according to the outlier, and the label is a probability label P. When the outlier is 1, i.e. the point is the edge point of the cluster, then the point should be considered as the neutral point, and finally the probabilistic label value is taken to be 0.5.

The obtained HDBSCAN predicted non-outlier is also the final label result of the HDBSCAN cluster.

Step 3, training a GRU network model by using the training data including the known normal log event data and the log event data with probability label estimation obtained in the step 2, and performing log anomaly detection by adopting the GRU network model:

for any given log sequence, S-e₁,…，e_t,…e_nWherein e is_tRepresenting the log event at the t-th time in the log sequence, the iteration process of the GRU network model is as follows:

wherein x is_tThe input variable representing the current time is calculated as above in step 1: vectorizing the given log event data to be detected, h_t-1The model hidden state variable representing time t-1 is a quantized representation containing information contained from the sequence to time t-1, W_z、W_r、U_z、U_rAll the parameters are GRU network model parameters which are digital matrixes, wherein different subscripts represent different parameters, the parameters of all the GRU network models are continuously adjusted according to training data in the training process of the GRU network models, and the initial values of the parameters are generally randomly initialized according to certain probability distribution (normal distribution is used in the method of the invention);

obtaining the values z of the two gating units_t、r_tThereafter, the GRU network performs further calculations through these two gating units. Wherein the reset gate unit r_tDetermines how many hidden states at time t-1 should be recorded in the "memory" at time t, and combines the hidden states with input x at time t_tThe merge is performed to retain some important content throughout the sequence by "resetting" the history information. Updating the gate cell z_tAnd determining the final hidden layer state at the current moment, and obtaining the hidden layer state at the final moment t by controlling the hidden layer state at the moment t-1 and the proportion of intermediate memory output of the reset gate. The values of the refresh gate unit and the reset gate unit at a certain time t are a fraction between 0 and 1 in the GRU network, as shown in the following equation:

the iteration process of the GRU obtains hidden layer state values h at all the time₁,h₂,…,h_nAnd then, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full-link layer, and obtaining the normalized classification label through a softmax function. Each hidden layer state of the GRU network represents information included until the sequence is terminated to a corresponding time, and the Pooling layer (Pooling ()) functions to assist in screening of "important" parts of the high-dimensional hidden layer vector. For example, the maximum pooling layer (maxPooling ()) screens out the maximum value in each dimension from all hidden layer states, which represents the state that the corresponding dimension has the most influence in the whole sequence, and the values of the finally screened dimensions constitute a final sequence representation vector O. After being screened by the pooling layer, the MLP () of the full connection layer has the functions of extracting and compressing the relation between different dimensions in a high-dimensional vector and reducing the vector latitude to adapt to the whole classification task. Finally, in order to more intuitively represent the probability, the invention uses a tanh () function to normalize the two-dimensional vector and convert all variables in the two-dimensional vector into decimal numbers in a range of-1 to 1 (such as a softmax () function, a tanh () function and the like). As shown in the following formula:

V_S＝tanh(W_AH+β),H＝[h₁,h₂,…,h_t] (6)

p (normal, abnormal) is tan h (W)_Non-Linear V_s) (7)

Wherein, tanh represents a hyperbolic tangent function, also called an activation function, aiming at normalizing the value in the final probability vector to the range of-1-1 to realize nonlinear transformation.

In order to verify the effectiveness of the log anomaly detection method and improve the effect of the existing log anomaly detection method, the log anomaly detection method performs a comparison experiment on two public log data sets. Three most advanced methods in the field of log anomaly detection at home and abroad at present are selected for comparing effects. The three candidate methods are: the LogCluster method proposed by Lin et al in 2016, the LogAnomaly method proposed by Meng et al in 2019, and the LogRobust method proposed by Zhang et al in 2019. The LogCluster method, like many other unsupervised learning-based methods, uses log event-based frequency of occurrence to represent log sequences. The LogCluster represents log sequences in a log set for training as vectors with fixed length by assisting different weights of different log events on the basis of the occurrence frequency of logs. On the basis of the vector, a multi-layer clustering algorithm is used for carrying out unsupervised clustering to respectively obtain the central point of each cluster. When the unknown logs are detected, the distance from the log expression vector to each cluster central point is respectively calculated to judge the category of the unknown logs. The loganomallly method is a representative of the latest semi-supervised learning method. Unlike LogCluster, the loganomallly method does not use the number of occurrences of a log as a basis for log representation, but represents log events by means of natural language semantics implied by the log. LogAnomaly builds a log sequence model of a log in a normal state by learning the sequence relation of log events in a normal log sequence. Any sequence under test that violates this log sequence model is considered a log sequence containing an anomaly. LogRobust is a typical representative of application of a recurrent neural network represented by LSTM in the field of log anomaly detection in recent years, and constructs a classification model by learning sequence difference and semantic difference in normal and abnormal logs by means of normal and abnormal logs marked in advance to detect unknown log sequences.

The reason why the three methods are selected in the comparative experiment of the present invention is that they represent the latest research progress of the log abnormality detection methods based on unsupervised learning, semi-supervised learning, and supervised learning, respectively, and that their comparison can sufficiently prove the effectiveness of the present invention in the log abnormality detection. As shown in table 1, there are three comparison methods and the log anomaly detection effect of the present invention on HDFS and BGL.

TABLE 1

The HDFS data is log data generated by distributed tasks such as MapReduce and the like operated by a Hadoop cluster, and normal log data and abnormal log data are generated by means of artificial injection of abnormality and the like. The BGL data is a log generated by the running of the super computer, and because the running naturally generates various data, a foreign expert team manually marks and opens the source of a part of the data (the time span exceeds 200 days), and contributes the part of the data to research and use. It should be noted that, the BGL data set has a long time span, and the log evolves during the time span, and the present invention uses the BGL data as a representative of log evolution to verify the effectiveness of the present invention in a log evolution scenario. In contrast, the HDFS data set is short in overall time, the log is not changed, and the HDFS data set is used for simulating a traditional abnormal detection scene based on the stable log.

In the aspect of measurement indexes, the effectiveness measurement indexes of the traditional classification method are used for evaluating the effectiveness of the method, namely the accuracy, the recall rate and the F1 value. The accuracy rate refers to how many log anomalies detected by the model are real anomaly logs. Recall refers to how many anomalies, of all types of anomalies, can be detected by the model. Accuracy and recall are somewhat two opposing metrics, as classification models typically set a set of thresholds to determine the final classification of the current data. This threshold is set differently, resulting in varying accuracy and recall. In order to unify the accuracy and the recall ratio into one index and facilitate quantitative analysis, researchers often use F1 value, i.e. the geometric mean value of the accuracy and the recall ratio, to measure the overall model. The calculation method of the F1 value is shown in equation (8):

according to the results in table 1, the present invention can be found to have an average detection accuracy of more than 95% on two data sets. Meanwhile, the method can also find more than 96% of different types of abnormalities, and the overall performance is excellent.

Compared with the existing semi-supervised and unsupervised methods, the method has the advantages that on the basis of the HDFS data set (under the scene that logs are relatively stable), the method is obviously improved, and the accuracy and the recall rate are improved. On the other hand, in the setting of the comparison experiment, the semi-supervised learning method can achieve the effect similar to a complete supervised learning algorithm by comparing with the LogRobust, so that various resources required by manual labeling are greatly saved, and the efficiency of log anomaly detection is improved.

The results on the data set of the BGL can prove that the method has very good applicability to the scenes of log evolution. The LogCluster can have relatively high accuracy for the log anomaly detection problem in the log evolution scene, but cannot completely cover the newly generated anomaly type. In contrast, the loganomally method generates a large amount of false alarms, and the accuracy of the overall method is very low, because the normal state learned from the history information cannot be applied to the scene after the log changes, a large amount of false alarms are generated. The invention combines the advantages of an unsupervised learning algorithm and a supervised learning algorithm, and solves the problem of false alarm in a log evolution scene through abstracting log natural language semantics, automatic labeling, combining a GRU network model of a self-attention mechanism and the like. Compared with the existing method, the method has the advantage that the effect of log anomaly detection is obviously improved. On the other hand, the log anomaly detection method is compared with LogRobust, and further proves that the log anomaly detection method can achieve log anomaly detection effect similar to completely supervised learning.

On the basis, in order to further prove that the method can accurately and effectively detect the abnormal data generated in real production life, the method also performs experiments on software logs in two domestic industries, and the results of the log abnormal detection effect experiment on two real systems are shown in table 2.

TABLE 2

The experimental results show that the method has very good detection capability on log anomalies generated in the real world, and meanwhile, the recall rates respectively reach 100% and 99.1%, which shows that the method can almost detect all types of anomalies. This set of experiments further validated the effectiveness of the present invention.

In order to accurately prove the effectiveness of the log anomaly detection technology based on probability label estimation, the invention adopts the idea of control variables and carries out independent comparison test on each component in the technology provided by the invention. To further highlight the effectiveness of the present invention, experiments were performed on the BGL data set using the technique of the present invention, PLELog and its variant techniques. As shown in table 3, experimental effects on the BGL dataset for different PLELog variants.

TABLE 3

As can be seen from the experimental results in table 3, the probability label estimation can effectively ensure that the method provided by the present invention is not affected by the "noise" data caused by automatic labeling, and the probability label estimation improves the accuracy of log anomaly detection by about 37%. On the other hand, the self-attention mechanism also well helps to solve the influence caused by log evolution, and the overall effect is improved by about 11%. This set of experimental results further verifies the validity of the proposed method and the necessity of the individual components.

The invention provides a log anomaly detection method with less manual intervention by utilizing the thought of semi-supervised learning and combining a clustering method and a supervised learning method for the first time. Meanwhile, the method also combines probability label estimation and a self-attention mechanism to ensure that the system abnormity can be accurately detected when the log evolves, and the overall robustness of the method is improved. The invention verifies the effectiveness of the invention on the key content of log anomaly detection by comparing two public data sets with the prior most advanced method. Meanwhile, the invention also performs tests on two real data sets in the industry, and further verifies the effect of the invention in actual production and life scenes.

Claims

1. A semi-supervised log anomaly detection method based on probability label estimation is characterized by comprising the following steps:

step 1: vectorizing the given log event data to be detected:

firstly, extracting a log event through log analysis, namely natural language description of system operation logic in the log; secondly, multiplying the frequency TF (w) of a word appearing in the log event by the inverse document frequency IDF (w) of the word appearing in all nonrepeating log events to obtain a weight score, applying the weight score to the original semantic vector of the word, and finally summing the weight vectors of all words in the log event to obtain a vector V of the log event_E；

For log event E ═ w₁,…w_i,…,w_n}，w_iRepresenting different words in a log event, log event vector V_EIs calculated as shown in equation (1):

wherein, # w represents the number of occurrences of the target word in the log event, # L represents the number of all non-repeating log events in the training set, # L_wRepresenting the number of log events containing a target word w in a training set, len (E) representing the length of the whole log event, and embed (w) representing the semantic vector of the target word w in a natural language to finally obtain a vector V of the log event_E；

Step 2, clustering the given input logs and automatically labeling the probability labels:

for the data of the unknown label, clustering is realized by using an unsupervised clustering algorithm together with the marked normal log to obtain a clustering result and a corresponding outlier, and a non-outlier predicted by the HDBSCAN is calculated according to the outlier, wherein a specific calculation formula (2) is as follows:

according to a formula (2), a label value which is originally 0 or 1 and is a label is corrected to be a certain number between 0 and 1 according to an outlier, namely a probability label P, according to a label y and the outlier which are given by a certain data x to be marked after the HDBSCAN is clustered;

then, the probability label P of each cluster is analyzed, if a certain clustering result contains a known normal log, the clustering result is a set of normal logs, otherwise, the clustering result is a set of abnormal logs;

for unmarked log event data in the cluster, giving a probability label to replace the original cluster label by the formula (2);

for any given log sequence S-e₁,…，e_t,…e_nWherein e is_tRepresenting the log event at the t-th time in the log sequence, the iteration process of the GRU network model is as follows:

for a certain time t, firstly, respectively calculating values z of updated gate units according to hidden layer states at the time t-1_tAnd resetting the value r of the gate unit_tAs shown in equation (3):

wherein x is_tAn input variable, h, representing the current time_t-1The model hidden state variable representing time t-1 is a quantized representation containing information contained from the sequence to time t-1, W_z、W_r、U_z、U_rAll the parameters of the GRU network model are parameters of the GRU network model and are digital matrixes, wherein the parameters represented by different subscripts are different, the parameters of all the GRU network models are continuously adjusted according to training data in the training process of the GRU network model, and the initial values of the parameters are randomly initialized according to certain probability distribution;

after obtaining the values z of the above two gating units_t、r_tThen, the GRU network model carries out further calculation through the values of the two gating units; wherein the value r of the gate unit is reset_tDetermining how many hidden layer states at t-1 time should be recorded in the memory of t time, and combining the hidden layer states with input x at t time_tMerging, namely, reserving important contents in the whole sequence through resetting the historical information; updating the value z of the gate cell_tDetermining the final hidden layer state at the current moment, and obtaining the final hidden layer state at the t moment by controlling the hidden layer state at the t-1 moment and the proportion occupied by intermediate memory output by the reset gate unit; the value of the refresh gate unit and the value of the reset gate unit at a certain time t are a decimal between 0 and 1 in the GRU network model, as shown in formula (4):

w, U are all GRU network model parameters;

obtaining hidden layer state values h at all moments in the iterative process of the GRU network model₁,h₂,…,h_nThen, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full connection layer MLP (), and obtaining a normalized classification label through a softmax function; the pooling layer comprises a maximum pooling layer maxPooling (), each hidden layer state of the GRU network model represents information contained until a sequence is cut off to a corresponding moment, the maximum pooling layer maxPooling () screens out the maximum value in each dimension from all hidden layer states and represents the state with the most influence of the corresponding dimension in the whole sequence, the finally screened value of each dimension forms a final sequence representation vector O, after the maximum pooling layer maxPooling () is used for screening, a full connection layer MLP () extracts and compresses the relation between different dimensions in a high-dimensional vector, and the output of the full connection layer MLP () is a two-dimensional vector and represents the probability that a given sequence is classified into normal and abnormal respectively; finally, the two-dimensional vector is normalized using the tanh () function, where all variables are converted to the-1-1 normDecimal inside the enclosure; as shown in equation (5):

performing pooling operation by calculating weights of different events at different positions and different weights of different hidden vector dimensions by using a self-attention mechanism layer on the selection of a pooling layer of the final model;

the calculation formula of the self-attention mechanism layer is shown in formula (6):

V_S＝tanh(W_AH+β),H＝[h₁,h₂,…,h_t] (6)

wherein, W_ABeta represents the weight matrix and error vector of the self-attention mechanism learning, H represents the final hidden layer output of the GRU network model, and V_SA vector representation representing the final log sequence;

obtaining a vector representation V of a log sequence_SThen, the final classification is performed by a non-linear transformation, as shown in equation (7):

p (normal, abnormal) ═ tanh (W)_Non-LinearV_s) (7)

And the tanh represents an activation function, and aims to normalize the value in the final probability vector to be in a range of-1 to 1 so as to realize nonlinear transformation.