Large-scale log data anomaly detection method and device and storage medium
Technical Field
The invention belongs to the technical field of data security detection, and particularly relates to a method and a device for detecting abnormality of large-scale log data and a storage medium.
Background
During the Hadoop cluster operation process, a large amount of log information such as service logs, audit logs and the like can be generated, the log information records the system operation state, the safety events and the internal relation of the safety events, and the safety event information contained in the system operation can be mined through the logs. The existing log anomaly detection methods include methods based on a rule base, mathematical statistics, machine learning algorithm, deep learning neural network and the like. The rule base based method mainly adopts rule matching, and has the advantages that the accuracy is high, the method has the defects that the method is limited to a specific scene, only specific log types can be targeted, and unknown security events are difficult to analyze; the normal value range is determined mainly by counting real-time data based on a mathematical statistical method, and the method has the advantages that unknown security events can be found, and has the disadvantages that the set statistical value threshold is difficult, and the types of the security events are difficult to distinguish; the machine learning algorithm based on learning ability is to establish a mining model and continuously carry out iteration and adjustment, and has the advantages of reducing rule manual coding and experience components and having the defects of more complex algorithm and higher implementation difficulty. The log anomaly detection method based on the deep learning neural network utilizes the long and short term memory network in the RNN to train a log entry sequence, learns and updates a log mode from a normal log execution path and marks log anomalies.
The log-based anomaly detection technology mainly comprises the following steps of log collection, log analysis, feature extraction and anomaly detection. The log template-based anomaly detection technology firstly performs basic cleaning on the log, then obtains log clusters based on the editing distance and forms a log template, and then performs log anomaly detection on log template information and parameter vector information contained in the log respectively.
However, there are difficulties in the log anomaly detection work: 1. the log structure and semantics of different systems are completely different, and some rule base-based methods mainly match through rules, which requires a great deal of expertise. However, this is not possible to implement a general log anomaly detection system. We need to implement a common log parsing method to deal with log patterns with different structures. 2. Another challenge arises from the concurrency of the system, because the logs record the current state and changes of the system in real time, when there are many concurrently executing threads or jobs in the system, the logs generated by these concurrent jobs will generate a great deviation for the generative model. 3. Because the log contains a large amount of information, the effect of realizing the abnormal detection by fusing different partial information is better, the existing method only realizes the utilization of partial information, and the types of the abnormal activities of the log are limited; 4. the problem that the efficiency of anomaly detection is low due to probability distribution of all logs during large-scale log detection is not considered.
Disclosure of Invention
The invention aims to solve the problem that the efficiency of anomaly detection is low because the probability distribution of each log is generated when large-scale log detection is not considered in the current anomaly detection.
In order to achieve the technical purpose, the invention adopts the following technical scheme.
The invention provides a method for detecting abnormality of large-scale log data, which comprises the following steps: inputting the selected log sequence with the set length into a pre-constructed machine learning prediction model, and outputting the conditional probability of each log template at the current position; screening the log templates according to the conditional probability of each log template to obtain a candidate log template set;
analyzing the log to be detected to obtain a log template of the log to be detected; and judging whether the log template corresponding to the log to be detected belongs to the candidate log template set, if so, judging the log to be normal, and if not, judging the log to be abnormal.
Further, the machine learning prediction model adopts a long-short term memory neural network prediction model, and the construction method comprises the following steps:
analyzing a pre-collected original log to obtain various parameters of the original log and a log template; selecting a log template corresponding to a correctly executed log as a training set to train a pre-constructed long-short term memory neural network prediction model, calculating the possible occurrence conditional probability of all log templates based on the current context, and outputting the conditional probability of each log template in the current position.
Furthermore, in order to obtain a better training effect, reordering the original logs according to the sequence of the task identifiers before analyzing the pre-collected original logs is further included.
Further, the log templates are screened according to the conditional probability of each log template by adopting a discount accumulated profit method, and a candidate log template set is obtained.
In a second aspect, the present invention provides an anomaly detection apparatus for large-scale log data, including:
the log analysis module is used for analyzing the log to be detected to obtain a log template;
the log template candidate set determining module is used for inputting the selected log sequence with the set length into a pre-constructed machine learning prediction model and outputting the conditional probability of each log template at the current position; screening the log templates according to the conditional probability of each log template to obtain a candidate log template set;
and the log abnormity detection module is used for judging whether the log template corresponding to the log to be detected obtained by the log analysis module belongs to the candidate log template set, if so, judging the log to be normal, and if not, judging the log to be abnormal.
Further, the log template candidate set determining module comprises a machine learning prediction model building module, and is configured to input a selected log sequence with a set length and output a conditional probability that each log template appears at a current position.
Further, the anomaly detection device further comprises a log collection module for collecting original logs.
Further, the log template candidate set determining module further comprises a log reordering module, which is configured to reorder the original logs according to an order of occurrence of the task identifiers before analyzing the original logs collected by the log collecting module in advance.
The invention also provides a computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method as provided in the above solution.
The beneficial technical effects are as follows:
the invention uses the neural network to detect the abnormity, and realizes the prediction of the conditional probability of the subsequent log according to the current log sequence. Selecting logs of which the ranks are the top as a candidate log set for a prediction result, and judging the log to be normal if the actual log accords with the candidate log set during online detection; otherwise, it is abnormal. On the basis, the method utilizes the information of the parameter value part, obtains a parameter value vector abnormity detection model by fitting training data, and judges that the model is normal if actual data falls in a high confidence interval of the model during online detection; otherwise, it is abnormal. The invention is a self-learning method, therefore, the abnormal detection model can be updated in increment to adapt to a new log mode.
According to the method and the device, the original logs are sequenced and analyzed, the log template and the parameter training machine learning model are obtained according to the analysis, and therefore the detection effect is better.
Drawings
Fig. 1 is a schematic diagram illustrating an anomaly detection method for large-scale log data according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating log parsing in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of log sequence ordering and conditional probability in an embodiment of the present invention;
FIG. 4 is a schematic diagram of an anomaly detection model in an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an abnormality detection apparatus for large-scale log data according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
An embodiment of a method for detecting an anomaly of large-scale log data, as shown in fig. 1, includes the following steps: inputting the selected log sequence with the set length into a pre-constructed machine learning prediction model, and outputting the conditional probability of each log template at the current position; screening the log templates according to the conditional probability of each log template to obtain a candidate log template set;
analyzing the log to be detected to obtain a log template of the log to be detected; and judging whether the log template corresponding to the log to be detected belongs to the candidate log template set, if so, judging the log to be normal, and if not, judging the log to be abnormal.
The observation and research on the log sequences show that the number of the subsequent logs of different log sequences in the actual environment is very different, a plurality of logs may occur behind some log sequences, and some logs may have only one condition. For example, as shown in fig. 1, for a case where the number of candidate logs is small, the probability distribution calculated by the model is larger, and for a case where the number of candidate logs is large, the probability distribution calculated by the model is more even. Thereby solving some disadvantages of the prior art.
In this embodiment, the machine learning prediction model adopts a long-short term memory neural network prediction model, and the construction method includes the following training steps:
analyzing a pre-collected original log to obtain various parameters of the original log and a log template; selecting a log template corresponding to a correctly executed log as a training set to train a pre-constructed long-short term memory neural network prediction model, calculating the possible occurrence conditional probability of all log templates based on the current context, and outputting the conditional probability of each log template in the current position.
The construction method of the long-short term memory neural network prediction model further comprises the following verification stages: extracting a verification data set to test the model, and selecting an optimal hyper-parameter as optimization of the model according to a result; the construction method of the long-short term memory neural network prediction model further comprises a testing stage: and testing the effect of the model according to the trained model and the selected optimal hyper-parameter, wherein the abnormal detection model is shown as a figure 4.
The collection and processing of the original log in this embodiment includes: firstly, the corresponding operations of component installation, log configuration and log collection are completed aiming at 10 components (including HDFS, Yarn, Hbase, MapReduce, Hive, Pig, Kafka, Zookeeper, Storm and Spark) of Hadoop.
The component logs in the Hadoop system are divided into service logs and audit logs, wherein the service logs are generated by default of the system and are running logs of each component, for example, the running logs generated by an Application Master record the starting time, the running time, the number of running mappers, the number of reducers, Counter information and the like of MapReduce in detail.
Before the collected logs are used as data for training the machine learning model, because log templates generated by different system platforms are different, in the embodiment, the logs are analyzed into two parts, namely the log template and a parameter value vector, and the two parts can be trained in the machine learning model.
Performing log analysis on the collected original logs, wherein the log analysis comprises the following steps: according to the collected component logs, by analyzing the format and structure of the logs and the previous investigation on the main stream log method, the embodiment adopts a general log analysis program.
The analysis program combines the longest common subsequence algorithm and the minimum edit distance algorithm to realize log clustering, and the log analysis module analyzes the original log into a log template and a parameter value vector. For example, the log:
"Receiving _ block _ blk _044 src:/10.251.215.16:55695 dest:/10.251.215.16: 50010", a log template "Receiving block src dest" is parsed out, and a parameter value vector "[ blk _044,/10.251.215.16:55695,/10.251.215.16:50010 ]". Since the log analysis algorithm is not the key point of the present invention, the LCS algorithm is implemented in the experiment of this embodiment, and the log analysis is implemented by using this program. The specific operation is that log clustering is realized based on LCS algorithm, log templates are extracted, and the frequency of each template is counted. The analyzed output is the decomposition of each parameter of the original log and the log template sequence corresponding to the original log. The log resolution is schematically shown in fig. 2.
Automatically learning a normal log sequence by using an LSTM neural network, then calculating the possible conditional probability of all logs based on the current context, and sequencing the logs according to the descending order of the probability; finally, similarly to measuring the search engine quality index by using the existing DCG (counted collected relative Gain) algorithm, we determine the number of candidate logs by using the existing DCG (counted collected relative Gain) algorithm to realize the determination of the logs.
The present invention considers the probability of a log occurring in the current context; the greater the correlation of one log with other normal logs, the greater the frequency of occurrence of the log; and vice versa. Logs with higher relevance can be considered normal, and conversely, anomalous. Similar to the design principles of search engines, the more relevant results are placed in a more advanced order. The probability distribution and the context correlation of each log during large-scale log detection are fully considered, so that the efficiency of performing anomaly detection on large-scale log data is remarkably improved.
The second embodiment provides an anomaly detection method for large-scale log data based on the first embodiment, and the method further includes reordering original logs according to the sequence of task identifiers before analyzing the original logs collected in advance. The principle of realizing the anomaly detection based on the sequence prediction is that a model trained by a normal log can mine and identify a normal behavior pattern in the log so as to predict a subsequent log. However, since there are many concurrently executing jobs in the system, and therefore many logs of the same task (uniquely identified by session _ id) have the same session _ id, but they are not continuous in the original log, in order to obtain better training effect, the original log needs to be reordered according to the sequence of occurrence of the session _ id of the task: the method is characterized in that a plurality of logs generated by the concurrent execution of one task are arranged together, and the plurality of logs generated by the concurrent execution of a plurality of tasks are sorted and ordered.
The specific implementation is that according to the log template sequence obtained in the previous step, the sequence of the log template is reordered by inquiring the corresponding session _ id (for example, blk _ id in the log in a Hadoop file system) in the original log, and the logs generated by concurrent execution are arranged in the same interval, so that multi-task decomposition is realized, and a model can be trained better.
For the (regular) logs after log reordering, firstly, a data preprocessing program is executed to extract correctly executed logs as a training set, and training of the model is started after various parameters of the model are set. In order to realize the iterative update of the model according to the feedback of the user in the subsequent process, the module also comprises a model updating program which can dynamically adjust the weight of the data.
In this embodiment, a Long Short-Term Memory (LSTM) neural network is also selected as a model for sequence prediction, and the prediction model is trained by normally executed log data to perform online prediction. The model inputs a log sequence of a given length (window length) and outputs a conditional probability (as shown in fig. 3) that each log template appears for the current location. The present embodiment abandons the method of determining the candidate log set by using the maximum branch number algorithm in the conventional method, and determines the number of the candidate logs by using a Discounted Cumulative Gain algorithm (Discounted Cumulative Gain). During online detection, if the subsequent log is in the candidate log set, the result is judged to be normal (the result is filled with 1), otherwise, the result is judged to be abnormal (the result is filled with 0). Third embodiment, the third embodiment, on the basis of the above embodiments, of the log item by item, the present embodiment provides an anomaly detection method for large-scale log data, in the present embodiment, an object of log anomaly detection is an application or Session (Session) uniquely identified by a block _ id or a Session _ id in a log, and the determination adopted in the present embodiment is that only when all logs in the Session are determined to be normal, the application (or Session) is considered to be normal application (or Session), and otherwise, the application (or Session) is determined to be abnormal, and an anomaly detection model is provided.
The construction of the machine learning prediction model in the embodiment includes:
1. a training stage: extracting normally executed log data, performing log analysis, performing model training by taking the obtained log template as input of a training set, and training a classification model represented by an LSTM long-term memory neural network. The method creatively uses the neural network to carry out anomaly detection, and combines the characteristics of the LSTM, thereby realizing the prediction of the conditional probability of the subsequent log according to the current log sequence. And selecting the logs in the top ranking as a candidate log set for the prediction result, and during online detection, if the actual log accords with the candidate log set, determining that the log is normal, otherwise, determining that the log is abnormal. On the basis, the method utilizes the information of the parameter value part, obtains a parameter value vector abnormity detection model by fitting training data, and judges that the model is normal if actual data falls in a high confidence interval of the model during online detection;
2. a verification stage: extracting a verification data set to test the model, and selecting an optimal hyper-parameter as optimization of the model according to a result;
3. and (3) a testing stage: and testing the model effect according to the trained model and the selected optimal hyper-parameter.
According to the method, the machine learning model is used for automatically predicting the conditional probability of the subsequent logs of the given log sequence, the candidate log set is selected according to the optimal hyper-parameter of the previous stage, and finally the logs are distinguished. If the model is predicted to be abnormal, a response is initiated and an abnormal execution path is reported, so that the efficiency of log abnormal detection is improved.
In a fourth embodiment, the present invention provides an apparatus for detecting an anomaly of large-scale log data (structure is shown in fig. 5), which includes a log parsing module, a log template candidate set determining module, and a log anomaly detecting module:
the log analysis module is used for analyzing the log to be detected to obtain a log template;
the log template candidate set determining module is used for inputting the selected log sequence with the set length into a pre-constructed machine learning prediction model and outputting the conditional probability of each log template at the current position; screening the log templates according to the conditional probability of each log template to obtain a candidate log template set;
and the log abnormity detection module is used for judging whether the log template corresponding to the log to be detected obtained by the log analysis module belongs to the candidate log template set, if so, judging the log to be normal, and if not, judging the log to be abnormal.
Further, the log template candidate set determining module comprises a machine learning prediction model building module, and is configured to input a selected log sequence with a set length and output a conditional probability that each log template appears at a current position.
Further, the anomaly detection device further comprises a log collection module for collecting original logs.
Further, the log template candidate set determining module further comprises a log reordering module, which is configured to reorder the original logs according to an order of occurrence of the task identifiers before analyzing the original logs collected by the log collecting module in advance.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.