CN113312447A

CN113312447A - Semi-supervised log anomaly detection method based on probability label estimation

Info

Publication number: CN113312447A
Application number: CN202110261887.XA
Authority: CN
Inventors: 杨林; 于瑞国; 陈俊洁; 王赞; 王维靖; 姜佳君
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-08-27
Anticipated expiration: 2041-03-10
Also published as: CN113312447B

Abstract

The invention discloses a semi-supervised log anomaly detection method based on probability label estimation, which comprises the following steps of 1: vectorizing the given log event data to be detected; step 2, clustering given input logs and automatically labeling probability labels: and 3, training a GRU network model by using the training data including the known normal log event data and the log event data with probability label estimation obtained in the step 2, and detecting log abnormity by adopting the GRU model. Compared with the prior art, the method can quickly and accurately detect the possible abnormity of the software system during operation, and improves the stability of the system; 2) the constructed model has strong robustness and wide applicability; 3) the problems of false alarm and failure of historical information caused by system evolution and log evolution are avoided, and the effect of the whole log anomaly detection method is guaranteed.

Description

Semi-supervised log anomaly detection method based on probability label estimation

Technical Field

The invention relates to the field of operation and maintenance of computer software, in particular to a log-based system anomaly detection method.

Background

Regarding log analysis based system anomaly detection: systems often produce a large number of logs at runtime. The log mainly comprises the logic description of the system operation and the state description of the system operation. The logical description of the system runtime is represented as a series of system "events," such as calls to a module, accesses to an external interface, and interactions with a database, among others. The series of events describes the different phases that the system is in a certain service. The state description of the system operation is represented as a group of parameters, such as system CPU utilization rate, memory utilization rate, HTTP request volume and response volume, and the like. The series of parameters describe that when the system runs to a specific stage, the whole system is in a state of scholar, and the system is a quantitative description of the running of the system. When the system is in an online operating state, system anomalies may occur for a variety of reasons. The most common causes are: the method is characterized by comprising the following steps of receiving network attacks from the outside, carrying out traffic exception, and carrying out possible bugs or defects on software. Currently, the industry mainly adopts a log-based anomaly detection means. After the system is abnormal, the system abnormity detection based on the log is a widely adopted solution, and the staff in charge of system operation and maintenance manually extracts the relevant log in the system for manual analysis. And (4) through the analysis of log events and log variables, finding out the abnormal position of the system, and submitting the abnormal position to a related responsible team for next diagnosis and repair. Firstly, a large amount of domain knowledge is required for supporting log anomaly detection, and a worker can judge whether an anomaly occurs or not by analyzing the description of the running state of the system in the log only if the worker deeply knows the whole system; secondly, manual anomaly detection often has hysteresis due to the complexity of the system structure, and system anomalies often cause problems in the operation of the system after a period of time, and further inform maintenance personnel to analyze the location and cause of the anomaly. If an anomaly is delayed for a long time to be alleviated or repaired, a huge loss is caused to the company.

Machine learning methods on semi-supervised: machine learning methods can be currently divided into two broad categories, supervised and unsupervised. Supervised machine learning uses training data with labels to fit features of the training data by training to reduce differences in model prediction results and labels. In contrast, unsupervised learning uses training data without labels in order to learn static statistical features in the data, such as clustering, principal component analysis, and the like. Compared with the unsupervised learning method, the supervised learning method has better fitting to the training data and better final result. In contrast, the unsupervised learning method benefits from the fact that labeling of training data is not needed, and in some situations, such as log analysis, labeling of data causes a great deal of resource waste or is almost impossible to achieve, and the unsupervised learning method is considered preferentially.

In addition to the above two types of machine learning methods, many researchers have explored how to combine the advantages of both supervised and unsupervised methods, and developed semi-supervised machine learning methods in recent years. The core idea of the semi-supervised machine learning method is to establish a learner by using model hypothesis on data distribution and label unlabelled data. The advantages are that: only a small amount of labeled data is used, the model effect similar to that of the supervised learning method can be obtained, and therefore the dependence of the supervised learning method on data labeling is reduced. Meanwhile, the final goal of the semi-supervised learning method is to train a supervised learning model through a certain means, so that the overall model effect can be ensured.

Deep Learning (DL) is one of the very popular machine Learning methods in recent years, and has been widely used in many fields. In the field of software engineering, deep learning models have also been studied quite intensively. At present, in the fields of log analysis and the like, a widely used model is mainly a Long-Short Term Memory (LSTM). The LSTM belongs to a cyclic neural network, and state information learned by a model is reserved through steps of circulation, iteration and the like, and is mainly used for processing data related to natural language texts or the like. LSTM has found primary application in the field of log analysis due to the high degree of similarity and inherent potential relationship between logs and natural language.

Currently, there are three main types of implementation methods in the field of system anomaly detection based on log analysis: (1) unsupervised learning methods represented by PCA and LogCluster, (2) supervised learning algorithms represented by LogRobust, and (3) semi-supervised learning methods represented by DeepLog and loganomally. However, the research field of log anomaly detection still faces the following challenges:

1) the existing unsupervised learning method can not achieve high coverage rate on all abnormal types by taking the frequency of the log events appearing in the log sequence as an index for measuring whether the system is abnormal or not. The method for detecting the abnormal state of the system by judging the difference of the number and the type of the log events appearing in the normal log and the abnormal log in the static distribution has the following defects: firstly, the method ignores the sequence relation of the log events in the time dimension, the sequence relation represents the flow of the system in operation, and the method cannot detect the system abnormity caused by the sequence abnormity of the log events. Second, using the frequency of occurrence of log events as a feature does not accommodate the evolution of the log. When a new log appears in the log to be tested or the log changes, the method can only capture part of log events, so that accurate judgment cannot be made.

2) The existing semi-supervised learning method carries out clustering or learning of the normal state of the system through unmarked data or only marked normal log data, and the method has poor effect and often generates a large amount of false reports. The model may generate a large number of false positives, limited by the number and quality of normal logs. The situation of false alarm is more serious when the log evolves, and any newly generated log or log with changes can be considered as abnormal and reported, thereby even increasing the workload of system operation and maintenance personnel to a certain extent.

3) Supervised machine learning methods, represented by LogRobust, require a large amount of data with normal and abnormal labels to fit. The amount and quality of the training data directly determines the effect of the final model. However, high quality annotation data is very difficult to obtain in large quantities. The difficulty is mainly focused on two aspects: firstly, when an exception occurs, due to the complexity of the system and high concurrency of threads, a very large number of logs can be generated at the same time, and a very large amount of labor can be consumed when the logs are judged and labeled one by one; secondly, due to the evolution of the log, especially under the high-speed development of the current 'cloud service', the micro-service architecture is widely applied, the system evolution speed is further accelerated, the characteristics in the historical log are often invalid in a short time, and a large amount of continuous manual labeling is needed to be carried out in order to ensure the effectiveness of the supervised learning method.

How to solve the above three challenges is a technical problem to be solved urgently in the field.

Disclosure of Invention

The invention provides a semi-supervised log anomaly detection method based on probability label estimation, aiming at breaking through the dilemma that the existing log anomaly detection depends on a large amount of manually marked data and the challenge of model misinformation caused by log evolution.

The invention discloses a semi-supervised log anomaly detection method based on probability label estimation, which comprises the following steps of:

step 1: vectorizing the given log event data to be detected:

firstly, extracting a log event through log analysis, namely natural language description of system operation logic in the log; it is composed ofSecondly, multiplying the frequency TF (w) of a word appearing in the log event by the inverse document frequency IDF (w) of the word appearing in all nonrepeating log events to obtain a weight score, applying the weight score to the original semantic vector of the word, and finally summing the weight vectors of all words in the log event to obtain a vector V of the log event_E；

For log event E ═ w₁,,…w_i,…,w_n}，w_iRepresenting different words in a log event, log event vector V_EIs calculated as shown in equation (5):

wherein, # w represents the number of occurrences of the target word in the log event, # L represents the number of all non-repeating log events in the training set, # L_wRepresenting the number of log events containing a target word w in the training set, len (E) representing the length of the whole log event, and embed (w) representing the semantic vector of the target word w in the natural language to finally obtain a vector V of the log event_E；

Step 2, clustering given input logs and automatically labeling probability labels:

for the data of the unknown label, clustering is realized by using an unsupervised clustering algorithm together with the marked normal log to obtain a clustering result and a corresponding outlier, and a non-outlier predicted by the HDBSCAN is calculated according to the outlier, wherein a specific calculation formula is shown as the following formula:

by utilizing the formula, according to a label y and an outlier given by a certain data x to be marked after HDBSCAN clustering, the original label value of 0 or 1 as a label is corrected to a certain number between 0 and 1 according to the outlier, and the label is a probability label P;

then, respectively analyzing the probability label P of each cluster, wherein if a certain cluster result contains known normal logs, the cluster result has a high probability of being a set of normal logs, and otherwise, the cluster result is a set of abnormal logs;

for unmarked log event data in the cluster, giving a probability label to replace the original cluster label by the formula (1);

step 3, training a GRU network model by using the training data including the known normal log event data and the log event data with probability label estimation obtained in the step 2, and performing log anomaly detection by adopting the GRU model:

for any given log sequence, S-e₁,…，e_t,…e_nWherein e is_tRepresenting the log event at the t-th time in the log sequence, the iteration process of the GRU model is as follows:

for a certain time t, firstly, respectively calculating the value z of an 'updating gate' unit according to the hidden layer state at the time t-1_tAnd the value r of the "reset gate" unit_tAs shown in the following formula:

wherein x is_tThe input variable representing the current time is calculated as above in step 1: vectorizing the given log event data to be detected, h_t-1The model hidden state variable representing time t-1 is a quantized representation containing information contained from the sequence to time t-1, W_z、W_r、U_z、U_rAll the parameters of the GRU model are GRU model parameters and are digital matrixes, wherein the parameters represented by different subscripts are different, the parameters of all the GRU models are continuously adjusted according to training data in the training process of the GRU model, and the initial values of the parameters are generally initialized randomly according to certain probability distribution;

after obtaining the values z of the above two gating units_t、r_tThereafter, the GRU network proceeds one through the two gating unitsCalculating the steps; wherein the reset gate unit r_tDetermines how many hidden layer states at the time t-1 should be recorded in the 'memory' at the time t, and combines the hidden layer states with the input x at the time t_tMerging is performed to keep some important contents in the whole sequence by 'resetting' the history information, updating the gate unit z_tDetermining the final hidden layer state at the current moment, and obtaining the hidden layer state at the final t moment by controlling the hidden layer state at the t-1 moment and the proportion of the intermediate memory output by the reset gate, wherein the values of an update gate unit and a reset gate unit at a certain moment t are a decimal number between 0 and 1 in the GRU network, and the decimal number is represented by the following formula:

w, U are GRU model parameters, similar to those in the calculation of the previous gate unit values, but independent of each other;

obtaining hidden layer state value h of all moments in iterative process of GRU₁,h₂,…,h_nThen, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full-link layer, and obtaining a normalized classification label through a softmax function; each hidden layer state of the GRU network represents information contained from sequence ending to corresponding time, a Pooling layer (Pooling ()) has the function of assisting in screening important parts in high-dimensional hidden layer vectors, a maximum Pooling layer maxPooling () screens out the maximum value in each dimension from all hidden layer states to represent the state that the corresponding dimension has the most 'influence' in the whole sequence, and the value of each finally screened dimension forms a final sequence list representation vector O; after being screened by the pooling layer, the MLP () of the full connection layer has the functions of extracting and compressing the relation between different dimensions in a high-dimensional vector and reducing the vector latitude to adapt to the whole classification taskThe probability of a sequence being classified as normal and abnormal; finally, in order to more intuitively represent the probability, the method uses a tanh () function to normalize the two-dimensional vector and convert all variables into a decimal softmax () function and a tanh () function within the range of-1 to 1; as shown in the following formula:

finally, the probability P that the given input log is classified into normal and abnormal is obtained;

using a self-attention mechanism layer on the selection of the pooling layer of the final model; the self-attention mechanism performs pooling operation by calculating weights of different events at different positions and different weights of different hidden vector dimensions, and the weights can be learned and optimized along with the whole anomaly detection model; the two phases are combined, so that the robustness of the log anomaly detection method is integrally improved;

the calculation formula of the self-attention mechanism is shown in formula (5):

V_S＝tanh(W_AH+β),H＝[h₁,h₂,…,h_t] (5)

wherein, W_ABeta represents the weight matrix and error vector of the self-attention mechanism learning, H represents the final hidden layer output of the GRU model, and V_SA vector representation representing the final log sequence;

after the vector representation of the log sequence is obtained, the final classification is carried out through a nonlinear transformation, and the operation of the nonlinear transformation is that V is subjected to the operation of V_sMultiplying by a linear transformation matrix W_Non-LinearAnd then, performing an activation function again, as shown in formula (6):

p (normal, abnormal) ═ tanh (W)_Non-LinearV_s) (6)

Wherein, tanh represents a callouse function, and the value in the final probability vector is normalized to be in the range of-1 to 1 so as to realize nonlinear transformation.

Compared with the prior art, the semi-supervised log anomaly detection method (PLELog) based on probability label estimation can achieve the following positive technical effects:

1) the method can quickly and accurately detect possible abnormity during the operation of the software system, and improve the stability of the system;

2) the constructed model has strong robustness and wide applicability, can be deployed in the environment that a general system is difficult to work, such as poor universality of a perception technology, complex human-object interaction and the like, and improves the query efficiency;

3) the influence caused by log evolution is reduced by deeply learning the natural language semantics in the log and combining with an attention mechanism in natural language processing; the problems of false alarm and failure of historical information caused by system evolution and log evolution are avoided, and the effect of the whole log anomaly detection method is guaranteed.

Drawings

FIG. 1 is a flowchart illustrating a semi-supervised log anomaly detection method (PLELog) based on probabilistic tag estimation according to the present invention;

fig. 2 is a schematic diagram of a specific implementation process of a semi-supervised log anomaly detection method (PLELog) based on probabilistic label estimation according to the present invention.

Detailed Description

The technical solution of the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

The semi-supervised log anomaly detection method (PLELog) based on probability label estimation is realized by adopting Python language, and a PyTorch framework is used as a support library of a deep learning model. The technical scheme of the method mainly comprises the following steps: 1) the method comprises three steps of automatic labeling based on probability label estimation, 2) a log anomaly detection method based on a Gated Recurrent Unit (GRU), and 3) a method for improving the robustness of a log anomaly detection model based on log natural language semantics and a self-attention mechanism, wherein the three steps are main innovation points of the method.

Fig. 1 is a flowchart illustrating an overall method for detecting an anomaly in a semi-supervised log based on probabilistic label estimation according to the present invention. The specific steps are described as follows:

step 1, vectorizing log data to be detected:

for a given log event, the invention first extracts the log event through log analysis, i.e. the natural language description of the system operating logic in the log. Secondly, by means of the preprocessing result of the word vector disclosed in the field of natural language processing research at home and abroad at present, the invention performs weighted summation of word frequency-inverse document frequency (TF-IDF) on each word in the log event. TF-IDF is originally a technology for extracting key words in natural language paragraphs, and the TF-IDF score is obtained by multiplying the occurrence frequency of words in a certain paragraph by the frequency of inverse documents of the words in the whole natural language corpus. The higher the TF-IDF score, the higher the weight of the corresponding word in the paragraph in which the word is located, which means that the word has a higher contribution to the semantics of the whole paragraph. For words in the log event, the invention multiplies the frequency of the word appearing in the log event by the inverse document frequency of the word appearing in all nonrepeating log events to obtain a weight score, the weight score is acted on the original semantic vector of the word, and finally, the weight vectors of all words in the log event are summed to obtain the vector of the log event. The specific processing and formula of log vectorization are as follows:

in order to accurately extract natural language semantics contained in a log, the influence of characteristic information such as keywords in the natural language on semantic content is combined, a keyword extraction technology 'TF-IDF' (term frequency-inverse document frequency) in natural language processing is used for reference, and improvement is carried out according to related characteristics of log events. For different words in the log event, combining the appearance frequency of the words in the log event with the appearance frequency of the words in all different logs in the training set to calculate the TF-IDF weight of the different words in the log event to the log event; and carrying out weighted summation on the semantic vectors of all words in the log event by using TF-IDF weight, and finally obtaining the semantic vector representation of the log event. For log event E ═ w₁,,…w_i,…,w_n}，w_iRepresenting different words in a log event, log event vector V_EIs calculated as shown in equation (1):

wherein, # w represents the number of occurrences of the target word in the log event, # L represents the number of all non-repeating log events in the training set, # L_wIndicating the number of log events containing a target word w in the training set, len (e) indicating the length of the whole log event, i.e. the number of total words contained, embed (w) indicating the semantic vector of the target word w in the natural language, and finally obtaining the vector of the log event, tf (w) indicating the frequency of the occurrence of a word in the log event, and idf (w) indicating the inverse document frequency of the occurrence of a word in all unrepeated log events.

Step 2, automatically labeling the probability label for the given input log:

step 2-1, analyzing and obtaining semantic information contained in a normal log by means of a log generated when a system operates normally, and separating the normal log from an abnormal log by means of an unsupervised clustering algorithm to obtain a marked normal log and a log sequence of 'suspected abnormality';

the unsupervised Clustering algorithm adopts a Hierarchical Density Clustering algorithm (HDBSCAN), the final state of the log is normal or abnormal, and for an unknown label, a neutral position is set between the log and the unknown label to represent the unknown label; and calculating the predicted non-outlier of the HDBSCAN according to the outlier, wherein a specific calculation formula is shown as a formula (2):

equation (2) represents: according to a label y and an outlier given after HDBSCAN clustering of certain data x to be marked, the original label value of 0 or 1 as a label is corrected to a certain number between 0 and 1 according to the outlier, and the label is a probability label P. When the outlier is 1, i.e. the point is the edge point of the cluster, then the point should be considered as the neutral point, and finally the probabilistic label value is taken to be 0.5.

The non-outlier predicted by the HDBSCAN obtained above is also the final label result of the HDBSCAN cluster.

Step 3, adopting a GRU model to detect log abnormity:

wherein x is_tThe input variable representing the current time is calculated as above in step 1: vectorizing the given log event data to be detected, h_t-1The model hidden state variable representing time t-1 is a quantized representation containing information contained from the sequence to time t-1, W_z、W_r、U_z、U_rAll the parameters are GRU model parameters and are digital matrixes, wherein different subscripts represent different parameters, the parameters of all the GRU models are continuously adjusted according to training data in the training process of the GRU models, and the initial values of the parameters are generally initialized randomly according to certain probability distribution (normal distribution is used in the method of the invention);

after obtaining the values z of the above two gating units_t、r_tThereafter, the GRU network performs further calculations through these two gating units. Wherein the reset gate unit r_tDetermines how many hidden layer states at the time t-1 should be recorded in the 'memory' at the time t, and combines the hidden layer states with the input x at the time t_tMerging is performed to preserve the entire sequence by "resetting" the history informationSome important matters of (2). Updating the gate cell z_tAnd determining the final hidden layer state at the current moment, and obtaining the final hidden layer state at the t moment by controlling the hidden layer state at the t-1 moment and the proportion occupied by the intermediate memory output by the reset gate. The values of the refresh gate unit and the reset gate unit at a certain time t are a fraction between 0 and 1 in the GRU network, as shown in the following formula:

obtaining hidden layer state value h of all moments in iterative process of GRU₁,h₂,…,h_nAnd then, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full-link layer, and obtaining the normalized classification label through a softmax function. Each hidden layer state of the GRU network represents information included until the sequence is cut off to a corresponding time, and the Pooling layer (Pooling ()) functions to assist in screening of "important" parts of the high-dimensional hidden layer vector. For example, the maximum pooling layer (maxPooling ()) screens out the maximum value in each dimension from all hidden layer states, which represents the state that the corresponding dimension has the most influence in the whole sequence, and the finally screened values of each dimension form a final sequence representation vector O. After being screened by the pooling layer, the MLP () of the full connection layer has the functions of extracting and compressing the relation between different dimensions in a high-dimensional vector and reducing the vector latitude to adapt to the whole classification task. Finally, in order to represent the probability more intuitively, the invention normalizes the two-dimensional vector by using a tanh () function, and converts all variables in the two-dimensional vector into decimal numbers in a range of-1 to 1 (such as a softmax () function, a tanh () function and the like).As shown in the following formula:

the invention uses a self-attention mechanism layer in the selection of the pooling layer of the final model; the self-attention mechanism performs pooling operation by calculating weights of different events at different positions and different weights of different hidden vector dimensions, and the weights can be learned and optimized along with the whole anomaly detection model;

the two phases are combined, so that the robustness of the log anomaly detection method can be integrally improved. The calculation formula of the self-attention mechanism is shown in formula (5):

V_S＝tanh(W_AH+β),H＝[h₁,h₂,…,h_t] (5)

wherein, W_ABeta represents the weight matrix and error vector of the self-attention mechanism learning, H represents the final hidden layer output of the GRU model, and V_SA vector representation representing the final log sequence.

After the vector representation of the log sequence is obtained, the final classification is carried out through a nonlinear transformation, and the operation of the nonlinear transformation is that V is subjected to the operation of V_sMultiplying by a linear transformation matrix W_Non-LinearAnd then, the activation is performed once by the activation function, as shown in formula (6):

p (normal, abnormal) ═ tanh (W)_Non-LinearV_s) (6)

Wherein, tanh represents a hyperbolic tangent function, also called an activation function, in order to normalize the value in the final probability vector to the range of-1 to 1, so as to realize nonlinear transformation.

In order to verify the effectiveness of the log anomaly detection method and improve the effect of the existing log anomaly detection method, the log anomaly detection method carries out comparison experiments on two public log data sets. Three most advanced methods in the field of log anomaly detection at home and abroad at present are selected for comparing effects. The three candidate methods are: the LogCluster method proposed by Lin et al in 2016, the LogAnomaly method proposed by Meng et al in 2019, and the LogRobust method proposed by Zhang et al in 2019. The LogCluster method, like many other unsupervised learning based methods, uses a log sequence represented based on the frequency of occurrence of log events. The LogCluster represents log sequences in a log set for training as vectors with fixed length by assisting different weights of different log events on the basis of the occurrence frequency of logs. On the basis of the vector, a multi-layer clustering algorithm is used for carrying out unsupervised clustering to respectively obtain the central point of each cluster. When the unknown logs are detected, the distance from the log expression vector to each cluster central point is respectively calculated to judge the category of the unknown logs. The loganomallly method is a representative of the latest semi-supervised learning method. Unlike LogCluster, the loganomallly method does not base the log occurrence number on the log representation, but represents the log event by the natural language semantics contained in the log. LogAnomaly builds a log sequence model of the log in a normal state by learning the sequence relation of the log events in the normal log sequence. Any sequence under test that violates this log sequence model is considered to contain an exception. LogRobust is a typical representative of application of a recurrent neural network represented by LSTM in the field of log anomaly detection in recent years, and constructs a classification model by learning sequence differences and semantic differences in normal and abnormal logs by means of normal and abnormal logs marked in advance to detect unknown log sequences.

The reason why the three methods are selected in the comparative experiment of the present invention is that they represent the latest research progress of the log abnormality detection methods based on unsupervised learning, semi-supervised learning, and supervised learning, respectively, and that their comparison can sufficiently prove the effectiveness of the present invention in the log abnormality detection. As shown in table 1, there are three comparison methods and the log anomaly detection effect of the present invention on HDFS and BGL.

TABLE 1

The HDFS data is log data generated by distributed tasks such as MapReduce and the like operated by a Hadoop cluster, and normal log data and abnormal log data are generated by means of artificial injection of abnormality and the like. The BGL data is a log generated by the running of the super computer, and because the running naturally generates various data, a foreign expert team manually marks and opens the source of a part of the data (the time span exceeds 200 days), and contributes the part of the data to research and use. It should be noted that, the BGL data set has a long time span, and the log evolves during the time span, and the present invention uses the BGL data as a representative of log evolution to verify the effectiveness of the present invention in a log evolution scenario. In contrast, the HDFS data set is short in overall time, the log is not changed, and the HDFS data set is used for simulating a traditional abnormal detection scene based on the stable log.

In the aspect of measurement indexes, the invention adopts the effect measurement indexes of the traditional classification method to evaluate the effectiveness of the method, namely the accuracy, the recall rate and the F1 value. The accuracy rate refers to how many log anomalies detected by the model are real anomaly logs. Recall refers to how many anomalies, of all types of anomalies, can be detected by the model. Accuracy and recall are somewhat two opposing metrics, as classification models typically set a set of thresholds to determine the final classification of the current data. This threshold is set differently, resulting in varying accuracy and recall. In order to unify the accuracy and the recall ratio into one index and facilitate quantitative analysis, researchers often use F1 values, i.e., geometric mean values of the accuracy and the recall ratio, to measure the overall model. The F1 value is calculated as shown in equation (8):

according to the results in table 1, the present invention can be found to have an average detection accuracy of more than 95% on two data sets. Meanwhile, the method can also find more than 96% of different types of abnormalities, and the overall performance is excellent.

Compared with the existing semi-supervised and unsupervised methods, the method has the advantages that on the basis of the HDFS data set (under the scene that logs are relatively stable), the accuracy and the recall rate are improved obviously. On the other hand, in the setting of the comparison experiment, the semi-supervised learning method provided by the invention is compared with the LogRobust to show that the semi-supervised learning method can achieve the effect similar to a complete supervised learning algorithm, so that various resources required by manual labeling are greatly saved, and the efficiency of log anomaly detection is improved.

The results on the data set of the BGL can prove that the method has very good applicability to the scenes of log evolution. The LogCluster can have relatively high accuracy for the log anomaly detection problem in the log evolution scene, but cannot completely cover the newly generated anomaly type. In contrast, the loganomally method generates a large number of false alarms, and the accuracy of the overall method is extremely low, because the normal state learned from the history information cannot be applied to the scene after the log changes, so a large number of false alarms are generated. The invention combines the advantages of an unsupervised learning algorithm and a supervised learning algorithm, and solves the problem of false alarm in a log evolution scene by abstracting log natural language semantics, automatically labeling, combining a GRU model of a self-attention mechanism and the like. Compared with the existing method, the method has the advantage that the effect of log anomaly detection is obviously improved. On the other hand, the log anomaly detection method is compared with LogRobust, and further proves that the log anomaly detection method can achieve log anomaly detection effect similar to completely supervised learning.

On the basis, in order to further prove that the method can accurately and effectively detect the abnormal data generated in real production life, the method also performs experiments on software logs in two domestic industries, and the results of the log abnormal detection effect experiment on two real systems are shown in table 2.

TABLE 2

The experimental results show that the method has very good detection capability on log anomalies generated in the real world, and meanwhile, the recall rates respectively reach 100% and 99.1%, which shows that the method can almost detect all types of anomalies. This set of experiments further validated the effectiveness of the present invention.

In order to accurately prove the effectiveness of the log anomaly detection technology based on probability label estimation, the invention adopts the idea of control variables and carries out independent comparison test on each component in the technology provided by the invention. To further highlight the effectiveness of the present invention, experiments were performed on the BGL data set using the technique of the present invention, PLELog and its variant techniques. As shown in table 3, experimental effects on the BGL dataset for different PLELog variants.

TABLE 3

As can be seen from the experimental results in table 3, the probability label estimation can effectively ensure that the method provided by the present invention is not affected by the "noise" data caused by automatic labeling, and the probability label estimation increases the accuracy of log anomaly detection by about 37%. On the other hand, the self-attention mechanism also well helps to solve the influence caused by log evolution, and the overall effect is improved by about 11%. This set of experimental results further verifies the validity of the proposed method and the necessity of the individual components.

The invention provides a log anomaly detection method with less manual intervention by utilizing the thought of semi-supervised learning and combining a clustering method and a supervised learning method for the first time. Meanwhile, the method also combines probability label estimation and a self-attention mechanism to ensure that the system abnormity can be accurately detected when the log evolves, and the overall robustness of the method is improved. The invention verifies the effectiveness of the invention on the key content of abnormal log detection by comparing the two public data sets with the most advanced method in the prior art. Meanwhile, the invention also performs tests on two real data sets in the industry, and further verifies the effect of the invention in the actual production and life scenes.

The invention adopts the cross-source knowledge graph information extraction and fusion technology, the knowledge graph has excellent effect in solving the challenges of huge data, complex relation and the like, and effectively expresses a complex relation network by adopting an entity-relation-entity triple structure and a node and edge link mode in a graph, thereby creating a human-object-space interaction model normal form in a complex environment such as a smart community and the like;

the invention adopts the traceable knowledge graph fragmentation fusion technology for the first time, and solves the problems of human-object-space information identification and alignment. In a complex human-object-space environment, data ambiguity is likely to occur in multi-source heterogeneous data, namely, a signal emission source and a signal cannot form a correct mapping relation, and signal identity management cannot be performed in a multi-source system. And generating human-object-space interaction knowledge graphs under different geographic positions by utilizing different knowledge graph fragment fusion and traceability technologies, wherein all people, object nodes and human-object interaction links in the geographic position are contained under each graph. A cluster of maps are formed at a plurality of places, and then the main chain and the side chain are connected through a chain-crossing fusion technology to form traceable multisource knowledge map fragmentation fusion.

The knowledge graph storage pressure under a human-object-space complex environment is solved by innovatively adopting a multilayer knowledge graph information abstraction technology, and meanwhile, the query efficiency is reduced. And for a plurality of fragment knowledge graphs generated by different geographic positions in the bottom layer, carrying out aggregation of similar fragment areas through feature clustering. And forming entities for fusing the knowledge maps from bottom to top, further abstracting the relationship among the entities, and finally forming the global knowledge map with highly abstract information at the top end so as to optimize the storage of the knowledge map and improve the query efficiency.

Claims

1. A semi-supervised log anomaly detection method based on probability label estimation is characterized by comprising the following steps:

step 1: vectorizing the given log event data to be detected:

firstly, extracting a log event through log analysis, namely natural language description of system operation logic in the log; secondly, multiplying the frequency TF (w) of a word appearing in the log event by the inverse document frequency IDF (w) of the word appearing in all nonrepeating log events to obtain a weight score, applying the weight score to the original semantic vector of the word, and finally summing the weight vectors of all words in the log event to obtain a vector V of the log event_E；

wherein, # w represents the number of occurrences of the target word in the log event, # L represents the number of all non-repeating log events in the training set, # L_wRepresenting the number of log events containing a target word w in a training set, len (E) representing the length of the whole log event, and embed (w) representing the semantic vector of the target word w in a natural language to finally obtain a vector V of the log event_E；

after obtaining the values z of the above two gating units_t、r_tThen, the GRU network performs further calculation through the two gating units; wherein the reset gate unit r_tDetermines how many hidden layer states at the time t-1 should be recorded in the 'memory' at the time t, and combines the hidden layer states with the input x at the time t_tMerging is performed to keep some important contents in the whole sequence by 'resetting' the history information, updating the gate unit z_tDetermining the final hidden layer state at the current moment, and obtaining the hidden layer state at the final moment t by controlling the hidden layer state at the moment t-1 and the proportion of the intermediate memory output by the reset gate, wherein the values of an update gate unit and a reset gate unit at a certain moment t are a decimal number between 0 and 1 in the GRU network, and the following formula is shown as follows:

obtaining hidden layer state value h of all moments in iterative process of GRU₁,h₂,…,h_nThen, combining the state values of all the hidden layers into a vector through a pooling layer, converting the vector into a classification label through a full-link layer, and obtaining a normalized classification label through a softmax function; each hidden state of the GRU network represents information included until a sequence is cut off to a corresponding time, the role of the Pooling layer (Pooling ()) is to assist in screening important parts in high-dimensional hidden vectors, the max Pooling layer maxPooling () screens out the maximum value in each dimension from all hidden states,representing the state that the corresponding dimension has the most influence in the whole sequence, and finally screening values of all dimensions to form a final sequence representation vector O; after being screened by the pooling layer, the MLP () of the full connection layer has the functions of extracting and compressing the relation between different dimensions in a high-dimensional vector and reducing the vector latitude to adapt to the whole classification task; finally, in order to more intuitively represent the probability, the method uses a tanh () function to normalize the two-dimensional vector and convert all variables into a decimal softmax () function and a tanh () function within the range of-1 to 1; as shown in the following formula:

V_S＝tanh(W_AH+β),H＝[h₁,h₂,…,h_t] (5)

after the vector representation of the log sequence is obtained, the final classification is carried out through a nonlinear transformation, and the operation of the nonlinear transformation is to carry out V_sIs multiplied by oneLinear transformation matrix W_Non-LinearAnd then, performing an activation function again, as shown in formula (6):

p (normal, abnormal) ═ tanh (W)_Non-LinearV_s) (6)