CN111611218A

CN111611218A - Distributed abnormal log automatic identification method based on deep learning

Info

Publication number: CN111611218A
Application number: CN202010333973.2A
Authority: CN
Inventors: 玄跻峰; 许宜森; 张玉虎
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-09-01

Abstract

The invention discloses a distributed abnormal log automatic identification method based on deep learning, which comprises the following steps: 1) acquiring log file data and preprocessing the log file data; 2) based on the preprocessed log, training by using a word2vec model to obtain a word vector of each word in the log; 3) converting sentences in the log text into sentence vectors by using the obtained word vectors; 4) inputting the sentence vectors into a long-term and short-term memory neural network model to train to obtain a two-classification model; 5) and processing the new log file and inputting the processed log file into the trained long-short term memory neural network model, and judging whether the input log is an abnormal log. The method establishes a classification model for automatically identifying the abnormal logs based on the deep neural network, converts the original manually identified abnormal logs into the automatically identified abnormal logs, reduces error risks caused by manually identifying the abnormal logs, and reduces labor and time costs of manually identifying the logs.

Description

Distributed abnormal log automatic identification method based on deep learning

Technical Field

The invention relates to a data mining technology, in particular to a distributed abnormal log automatic identification method based on deep learning.

Background

Modern software is increasingly complex and large in scale, and software maintenance cost is increased. The widespread use of distributed and heterogeneous software systems makes it extremely difficult to manually monitor the operational status of the software and to discover operational failures. The log is an indispensable output form during software operation. In order to find the fault of the distributed system as early as possible and reduce the potential downtime risk, a large number of distributed systems save the software state during operation through real-time log output, and provide a data basis for maintenance personnel.

In modern distributed systems, maintenance personnel can manually check the software runtime state, discover and analyze where the fault is, based on the log output by the system. However, a large number of distributed systems remain operational around the clock, outputting a huge amount of log data each day. This makes it very difficult to manually analyze the entire log.

In order to find out the fault and the potential risk in the software operation through the log, maintenance personnel manually define the log characteristics corresponding to the correct log based on the normal log set. For a new log, whether the log is output when the program is executed normally can be identified by matching the log with the log characteristics, that is, whether the log has abnormal behavior. If not, the operation fault or potential risk of the software is indicated, and further manual analysis can be carried out according to the operation fault or potential risk. However, it is time-consuming and error-prone for maintenance personnel to manually define the correct log features, mainly because (1) the log itself is complex and manual definition of log features often occurs under-definition; (2) the continuous integrated development of modern software causes the software version to change frequently, and the definition of the log feature needs to be changed frequently. For the above two reasons, the method of manually defining log features and then identifying abnormal logs consumes a lot of labor and time cost in practice.

Disclosure of Invention

The invention aims to solve the technical problem of providing a distributed abnormal log automatic identification method based on deep learning aiming at the defects in the prior art, and the method can reduce the error risk caused by manually identifying abnormal logs.

The technical scheme adopted by the invention for solving the technical problems is as follows: a distributed abnormal log automatic identification method based on deep learning comprises the following steps:

1) acquiring normal and abnormal log sets and preprocessing the normal and abnormal log sets; intercepting a timestamp of each log, sequencing log messages in a log file by using timestamp character strings in the logs, and then filtering the timestamp character strings in each log;

2) based on the preprocessed log, training by using a word2vec model to obtain a word vector of each word in the log;

3) converting sentences in the log text into sentence vectors by using the obtained word vectors;

4) inputting the sentence vectors into a long-term and short-term memory neural network model to train to obtain a two-classification model;

5) preprocessing a new log file, extracting word vectors, converting sentences in the log file into sentence vectors, inputting the sentence vectors into a trained long short-Term Memory neural network model (LSTM), and judging whether the input log is an abnormal log or not; the new log file is a file of which the occurrence time of the log message is after the log file is trained;

according to the scheme, in the step 2), word vectors of each word in the log are obtained by using word2vec model training, the word2vec model training mode uses a skip-gram or CBOW word model calculation mode, and a negative sampling model is adopted for training to obtain the word vectors.

According to the scheme, the training process of the medium-long short-term memory neural network model in the step 4) is as follows:

4.1) each neural unit input vector X is a sentence vector, and the sentence vectors are sequentially input into the long-term and short-term memory neural network model according to the time sequence;

4.2) the input vector of each nerve unit passes through a forgetting gate,The knowledge information is stored in a knowledge base C after the input gate and the output gate are processed, and the knowledge after the current neural unit is processed is output to h_t+1In the meantime, h outputted from the last neural unit_t+1The knowledge of (a) is input into the next neural unit;

the activation function of the forgetting gate is a sigmoid function, and the data after the current vector forgetting is used as the vector inner product of the weight and the knowledge base to realize the partial forgetting of the old knowledge;

the input gate is a set of input sentence vectors and output vectors of a previous neural unit, and specifically comprises the following steps: firstly, obtaining a memory weight by a sigmoid function of a current vector and a vector inner product, and secondly, obtaining knowledge by tanh value of the current vector; thirdly, obtaining the latest knowledge which is partially forgotten by the vector inner product of the memory weight and the knowledge; finally, merging the new knowledge into a knowledge base;

the output gate is the current neural unit output obtained by taking the tan h value of the vector of the knowledge base and the weight inner product of the tan h value and the input gate;

4.3) obtaining h for each neural Unit_t+1Vector, all h_t+1Inputting the vector into an average value pool layer;

and 4.4) inputting the vector of the average value pool into a regression classification layer, and classifying the average value vector by using a regression classification method to obtain a two-classification model of the long-term and short-term memory neural network.

The invention has the following beneficial effects:

the invention establishes the classification model of automatic identification of the abnormal logs based on the deep neural network, automatically generates the classification model of the abnormal logs based on the long-term and short-term memory neural network model, converts the original manual identification of the abnormal logs into the automatic identification of the abnormal logs, reduces the error risk caused by the manual identification of the abnormal logs and reduces the labor and time costs of the manual identification of the logs.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a skip-gram model according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a CBOW model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the LSTM neural unit model structure according to an embodiment of the present invention;

fig. 5 is a schematic diagram of the classification of LSTM according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a method for automatically identifying a distributed abnormal log based on deep learning includes:

(1) and (5) a model training stage.

Firstly, preprocessing an original log to enable the log to meet the input requirement of a word2vec model. Firstly, intercepting the time stamp of each log, sequencing the log messages in the log file by using the time stamp character strings in the logs, and then filtering the time stamp character strings in each log.

Based on the preprocessed log, a word vector word2vec model training mode of each word in the log is obtained by using word2vec model training, a skip-gram or CBOW word model calculation mode is used, and a negative sampling model is adopted for training to obtain a word vector. During the training of the word2vec model, a dictionary is established to represent all non-repetitive words, an N-dimensional vector (the element of the vector is 0 or 1) is used to represent each word in the dictionary, and the vector is used for calculation during the training of the model.

In the skip-gram model, the probability of adjacent words is trained according to a given word (the central word w (t)) in a sentence by using a negative sampling model. Wherein the maximum distance from the neighboring word to the central word is called the window, as shown in fig. 2.

In fig. 2 w represents a word and t represents the position of the word w in a sentence. The core word w (t) uses a maximum likelihood estimation method to calculate the probabilities of the t-2, t-1, t +1, and t +2 position words.

In the CBOW model, just as opposed to the skip-gram model, the probability of a headword is trained by a negative sampling model of its neighbors, as shown in FIG. 3.

Training a negative sampling model (taking a skip-gram model as an example): first assume a core word w_cAnd word w in the window_oProbability and central word appearing in window and all words not appearing in window simultaneously_kThe probabilities which do not occur simultaneously are independent of each other, and then the probability calculation model is as the formula (1).

Wherein D is w_oAnd w_cA flag of whether or not within the same window, 1 indicates that one window is present and 0 indicates that there is no more window. I.e. by the central word w_cCalculating to obtain the word w in the window_oThe probability of (c).

After text data is input into a word2vec model, the model firstly counts the occurrence frequency of words according to the text, then carries out probability calculation on each word according to a formula (1), and finally outputs vector representations of all words. The word vector obtained through word2vec model training shows that the larger the cosine value of two different word vectors is, the more similar the semantics of the two words are.

The dimension of the word vector may be set according to the size of the data amount, and is generally set to 100. And converting sentences in the log text into sentence vectors by using the obtained word vectors, and inputting the sentence vectors into the long-term and short-term memory neural network model to train so as to obtain the two classification models.

Training of long-short term memory neural network model (as shown in fig. 4): and finally, sequentially inputting the sentence vectors into the long-term and short-term memory neural network model according to the time sequence.

When the long-short term memory neural network model is trained, word vectors are input into the neural unit, and knowledge is processed by an input gate, an output gate and a forgetting gateThe information is stored in the knowledge base C. And outputs the knowledge after the current neural unit processing to h_t+1In (1). Wherein h is_t+1Is input into the next neural unit. The activation function of the forgetting gate is sigmoid function value range [0,1]As weight parameters, 0 represents forgetting, 1 represents total remembering, the parameter between 0 and 1 represents partial forgetting, and the data after the current vector forgetting is taken as the vector inner product of the weight and the knowledge base, so as to realize the partial forgetting of old knowledge.

The input gate is a set of the input sentence vectors and the output vectors of the previous neural unit. Firstly, the memory weight is obtained by the sigmoid function of the current vector and the vector inner product. Secondly, the tanh value of the current vector obtains knowledge. Thirdly, the vector inner product of the memory weight and the knowledge obtains the latest and partially forgotten knowledge. Finally, the new knowledge is merged into the knowledge base. The output gate is the current neural unit output obtained by taking the tan h value of the vector of the knowledge base and carrying out the inner product of the tan h value and the weight of the input gate.

Obtain h for each neural unit as in FIG. 4_t+1And (5) vector quantity. Finally all h are added_t+1And inputting the vector into an average value pool layer for further processing. As shown in fig. 5.

The effect of the average pool in FIG. 5 is to pool all h_t+1Vector, h to be adjacent_t+1The vectors are averaged.

And finally, inputting the vector of the average value pool into a regression classification layer, and classifying the average value vector II by using a regression classification method.

(2) And a model application phase.

At this stage, a new log file is preprocessed, word vectors are extracted, sentences in the log file are converted into sentence vectors, the sentence vectors are input into the trained long-term and short-term memory neural network model, and the new log file refers to a log of which the log time point is behind a training log. The trained long and short term memory neural network model can judge whether the input logs are abnormal logs, and finally the logs judged to be abnormal can be output to an appointed file for the operation and maintenance staff to look up.

Software in the operating environment of a distributed system generates a large amount of log data each day. Depending on the enterprise architecture, the location of the log data store generated by the software on the distributed system is different. The log data on the log server stores the running state information of the software, including error information of the software, correct running information of the software, interactive information of the software and the like.

The application system scene of the method mainly comprises (1) a distributed system (2) and a workstation in a log server (3) system. The data generated by the system can be analyzed on the basis of the server of the distributed data storage, and a corresponding software working model is extracted. The maintenance personnel judges whether the log information output by the program later contains errors or not based on the system: if the corresponding state log information of the program is not contained in the established model of the program, the exception of the program during the operation is shown.

Automatically generating an abnormal log classification model based on the long-term and short-term memory neural network model; according to the method, all log information is trained into a digital vector by using a word2vec model, the digital vector is input into a long-term and short-term memory neural network model, and a two-classification model is obtained through training.

The invention converts the original manual identification abnormal log into the automatic identification abnormal log, reduces the error risk caused by the manual identification of the abnormal log, and reduces the labor and time cost of the manual identification of the log. The beneficial effects of the key technical points are described as follows:

(1) the process of manually searching abnormal logs in the log set is simplified, and automatic classification of the logs is realized;

(2) based on massive log data, a log classification model generated during software operation is established;

(3) the text file is convenient for maintenance personnel to manually check and understand;

(4) the error of manually screening abnormal logs is reduced;

(5) the labor cost and the time cost of frequently updating the log features are reduced.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A distributed abnormal log automatic identification method based on deep learning is characterized by comprising the following steps:

5) preprocessing a new log file, extracting word vectors, converting sentences in the log file into sentence vectors, inputting the sentence vectors into a trained long-short term memory neural network model, and judging whether the input log is an abnormal log or not; the new log file is a file of which the occurrence time of the log message is after the log file is trained.

2. The method for automatically identifying the distributed abnormal logs based on the deep learning as claimed in claim 1, wherein in the step 2), word vectors of each word in the logs are obtained by using word2vec model training, the word2vec model training mode uses a skip-gram or CBOW word model calculation mode, and a negative sampling model is adopted for training to obtain the word vectors.

3. The method for automatically identifying the distributed abnormal logs based on the deep learning as claimed in claim 1, wherein the training process of the step 4) middle and long short term memory neural network model is as follows:

4.2) after the input vector of each nerve unit is processed by a forgetting gate, an input gate and an output gate, the knowledge information is stored in a knowledge base C, and the knowledge after the current nerve unit is processed is output to h_t+1In the meantime, h outputted from the last neural unit_t+1The knowledge of (a) is input into the next neural unit;