CN112905380A

CN112905380A - System anomaly detection method based on automatic monitoring log

Info

Publication number: CN112905380A
Application number: CN202110300903.1A
Authority: CN
Inventors: 王书敏; 任洪敏
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2021-06-04

Abstract

The invention provides a system abnormity detection method based on an automatic monitoring log, which comprises the following steps: acquiring original log data of a software system, and extracting effective information contained in the original log data according to a log template to obtain an initial log set; normalizing the log information of the initial log set to obtain a normalized log set, analyzing the generated characteristics of the normalized log set, and extracting the characteristics to obtain a training log set; performing pattern training on the behavior sequence based on the training log set to generate a corresponding behavior pattern; and carrying out abnormal behavior detection on the real-time log stream, calculating an abnormal index, and judging the system state by comparing the abnormal index with an abnormal threshold value to obtain a log abnormal detection result. The invention overcomes the defects of lower discrimination accuracy and generalization capability in the existing anomaly detection method, improves the detection accuracy, solves the problems of processing a large amount of log data and non-uniform log structure to a certain extent, and can accurately predict the possible anomaly of the system in time.

Description

System anomaly detection method based on automatic monitoring log

Technical Field

The invention relates to the technical field of anomaly detection and fault early warning of software systems, in particular to a system anomaly detection method based on an automatic monitoring log.

Background

With the rapid development of computer technology and the internet, people have entered a big data era with extremely rich information and extremely massive data. Today's software systems are becoming increasingly large and complex, and the occurrence of exceptions and errors becomes difficult to avoid. Software exceptions are flooded at each stage of software development and are included in the final delivered software product. Today, analyzing the system log has become the most important means for determining whether the system is abnormal. The abundant information contained in the system log can help system developers and maintainers to better understand system behaviors and detect and locate system anomalies in the production process.

In the field of automatic anomaly detection of software systems, much research and accumulation have been carried out in recent years, but the requirements of real application environments are still difficult to meet. The concrete points are as follows: (1) the software systems have great difference in the aspects of behavior, input, output and the like, some automatic anomaly detection methods proposed at present are often effective only for a certain type of systems, and the universal anomaly detection method is difficult to achieve a good detection effect on most systems; (2) the modern software system is built on a new generation technology represented by cloud computing, has strong lateral expansion capability, is uncommon for a distributed system comprising thousands of computing nodes, and has very high concurrency, while the traditional automatic anomaly detection method mainly focuses on detecting the fault of a single node (service); (3) most of the current automatic anomaly detection methods are not intuitive, and few methods can provide meaningful information about anomalies, and cannot provide more help for system detection personnel to diagnose anomalies after the anomalies are reported.

The log-based anomaly detection method is based on the analysis of unstructured data in an original log file. The existing research is mainly divided into: statistical-based methods, classification-based methods, cluster analysis methods, information theory-based methods, and graph model-based methods. The log anomaly detection method based on statistics is based on designing a statistical model, namely, a model is firstly created for data, and the model is evaluated according to the condition of fitting the model to an object, however, if an incorrect model is selected, the object is likely to be wrongly judged as an anomaly point; the log abnormity detection method based on classification is mainly a supervised method, an optimal model is obtained through training by the aid of existing training samples, namely known data and corresponding output of the known data, all inputs are mapped into corresponding output by the aid of the model, and the output is simply judged, so that the classification purpose is achieved. The most important disadvantage of the anomaly detection method based on supervised learning is not in the technical level, but in that a large amount of labeled training data is needed, and the cost for acquiring the labeled data is high, which greatly limits the application range of the anomaly detection method based on supervised learning. The clustering-based log anomaly detection method is to cluster similar data instances into one class, the clustering is mainly a classic and unsupervised machine learning method, the clustering method is premised on the premise that normal log data instances belong to the class with large log data instance quantity and high density, and abnormal log data instances belong to the class with small log data instance quantity and low density. The basic assumption of the log anomaly detection method based on the information theory is that the abnormal log data can cause irregularity of the whole log data in the information quantity, and different information theory-based methods use different information theory measurement methods, such as kolmogorov complexity, entropy, relative entropy and the like, to analyze the information quantity of the data set. The most important thing of the log anomaly detection method based on the graph model is to construct a finite automatic state machine which can represent normal behaviors well, then match the log data with the finite automatic state machine, and if the finite automatic state machine cannot match with some log data, the log data are likely to be anomalous log data.

Disclosure of Invention

The invention aims to provide a system abnormity detection method based on an automatic monitoring log.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a system anomaly detection method based on an automatic monitoring log comprises the following steps:

step S1: acquiring original log data of a software system, and extracting effective information contained in the original log data according to a log template to obtain an initial log set;

step S2: normalizing the log information of the initial log set to obtain a normalized log set, analyzing the generated characteristics of the normalized log set, extracting the characteristics to obtain a corresponding log characteristic set, and dividing the normalized log set into different types of behavior sequences according to the log characteristics, namely training log sets;

step S3: performing pattern training on the corresponding behavior sequence based on the training log set to generate a corresponding behavior pattern;

step S4: and carrying out abnormal behavior detection on the real-time log stream, calculating an abnormal index, and judging the system state by comparing the abnormal index with an abnormal threshold value to obtain a log abnormal detection result.

Further, in step S2, the manner of normalizing the log information of the initial log set includes at least one of:

rearranging the irregular log records;

parameterizing, and replacing the numerical data with placeholders;

adjusting a log structure to adjust records spanning multiple rows into one row;

removing redundant characters;

the log level is converted to a digital representation.

Further, in step 2, analyzing the generated features of the normative log set, performing feature extraction to obtain a corresponding log feature set, and dividing the normative log set into different types of behavior sequences according to the log features, including:

carrying out standardization and dimension reduction on log data, and selecting the most effective log features from the log data;

performing data conversion on the selected log features to form a log feature set;

selecting a similarity standard, and finding out a distance function which is most suitable for the characteristic type or constructing a new distance function;

and executing a clustering algorithm to divide the normative log set into different types of behavior sequences.

Further, the clustering algorithm is to cluster the log data by using a hierarchical method, and adopts a bottom-up aggregation mode.

Further, in step S3, performing pattern training on each corresponding behavior sequence based on the training log set, and generating a corresponding behavior pattern, including:

assigning a type number to each log type, taking a training log set as input, sequentially reading each log record in the training log set, mapping the normalized log to the corresponding log type, finally outputting the corresponding type number, wherein a final result sequence comprises a log time stamp and the corresponding type number, and converting the result sequence into a frequency sequence;

traversing the frequency sequence by a sliding window technology, and extracting all frequency subsequences as behavior subsequences;

defining a similarity measurement standard for the behavior subsequences, counting the number of identical and similar frequency subsequences, and taking the shape characteristics and the occurrence frequency of different types of behavior subsequences as the behavior mode of the type of behavior subsequences.

Further, in step S4, the abnormality index is calculated by the following formula:

wherein L represents a log sequence formed by the real-time log stream,

representing a sequence of behaviors d_iAnd

of each sequence of behaviors d_iCorresponding to a behavior pattern set P_i，

Represents the x-th pattern of behavior, and d_iThe behavior pattern with the highest similarity is recorded as

Representing behavioral patterns

The outlier of (b), β, represents a balance factor.

Further, the update time of the behavior pattern is set, and when the update time is reached, the steps S1 to S3 are executed again to update the behavior pattern.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a log preprocessing method based on log standardization and hierarchical clustering, log data after preprocessing are classified into different types, and the log preprocessing method has a good structure and is convenient for extracting subsequent behavior patterns and judging abnormal behaviors; and judging the system according to the magnitude of the abnormal value based on the general log abnormality detection model of the behavior abnormality, and predicting the system abnormality which possibly occurs. The method not only improves the system abnormity detection accuracy, but also solves the log analysis problems of large system log data volume, complex structure, difference, various system fault types and the like to a certain extent, and is a universal log detection algorithm.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description will be briefly introduced, and it is obvious that the drawings in the following description are an embodiment of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts according to the drawings:

FIG. 1 is a flow chart of a system anomaly detection method based on an automated monitoring log according to the present invention;

FIG. 2 is a flowchart of the overall implementation of the present invention;

FIG. 3 is a flow chart of anomaly detection provided by the present invention;

fig. 4 is a schematic diagram of a general architecture of system anomaly detection provided by the present invention.

Detailed Description

The technical solution proposed by the present invention will be further described in detail with reference to the accompanying drawings and the detailed description. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise scale for the purpose of facilitating and distinctly aiding in the description of the embodiments of the present invention. To make the objects, features and advantages of the present invention comprehensible, reference is made to the accompanying drawings. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the implementation conditions of the present invention, so that the present invention has no technical significance, and any structural modification, ratio relationship change or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.

Based on an analysis of the problems of the prior art, what is needed is a real-time, efficient, versatile log monitoring system that can handle different log structures. Research and experiments are carried out aiming at the problems, and a behavior abnormity detection method based on system log monitoring is provided. The anomaly detection method not only improves the detection accuracy rate, solves the problems of processing a large amount of data and non-uniform log structure to a certain extent, is a general log detection algorithm, and can accurately predict the occurrence possibility of the system in time. The method not only overcomes the defects that the discrimination accuracy and generalization capability are low and the fault which does not appear in the training sample cannot be pre-warned in the anomaly detection method based on machine learning, but also overcomes the defects that the anomaly detection method based on knowledge needs to consume great time cost and labor cost.

As shown in fig. 1, an embodiment of the present invention provides a system anomaly detection method based on an automatic monitoring log, including:

step S2: normalizing the log information of the initial log set to obtain a normalized log set, analyzing the generated characteristics of the normalized log set, extracting the characteristics to obtain a corresponding log characteristic set, and dividing the normalized log set into different types of behavior sequences according to the log characteristics to obtain a training log set;

step S4: and acquiring real-time log data of the software system to detect abnormal behaviors, calculating an abnormal index, and comparing the abnormal index with an abnormal threshold value to judge the system state to obtain a log abnormal detection result.

The steps of the present invention are described in detail below with reference to fig. 2-4.

Step 1: collecting and analyzing log data: collecting the log data generated by each node together, carrying out source code analysis through an abstract syntax tree, converting unstructured data into structured data, extracting effective information contained in an initial log set, and using the effective information as input of subsequent log division and feature mining to obtain the initial log data set.

The method comprises the following specific steps:

step 1.1, aiming at two methods of character string splicing and method calling for obtaining the log, the invention respectively adopts different methods to generate the log template.

For the character string splicing mode, the log template is obtained by splitting the addition expression. For the method call, a template is generated by using a method of an Abstract Syntax Tree (AST) of the program. The input recognized by the log template is a program source code, an abstract syntax tree is firstly constructed, then traversal is carried out on the abstract syntax tree, a method with a return value type of String is found, a regular expression of the return value of the method is obtained, and therefore a corresponding log template is generated. Since the same-name methods may be contained in different classes, the method name must use the fully qualified name of the method. And finally, further integrating the log templates generated by the two methods.

And 1.2, extracting effective information contained in log data according to the log template to be used as input of subsequent log division and feature mining.

Step 2: preprocessing log data: normalizing the initial log set to obtain a normalized log set, analyzing the generated characteristics of the log set, performing characteristic extraction on the log event sequence set to obtain a corresponding characteristic set, and dividing the normalized log set into different types of behavior sequences according to the log characteristics to obtain a training log set.

The method comprises the following specific steps:

step 2.1, log normalization: in log normalization, log data is rearranged and log records are not normalized, parameterization operation is removed, numerical data is replaced by placeholders, a log structure is adjusted, records crossing multiple lines are adjusted into one line, and redundant characters are deleted. The normalized data is stored in a relational database, with each attribute of the tuple representing an item in the log record.

The log grade is converted into digital representation, so that the similarity measurement in log clustering is facilitated.

When an error occurs in the operation of the system and an exception is thrown, the log not only records the time and the position of the error, but also throws the exception function and the call stack thereof.

The same type of log may be misinterpreted as a different type during log analysis because of different parameter values in the log messages. To solve this problem, a de-parameterization approach is used, replacing the numerical parameters appearing in each log with key placeholders.

Step 2.2, the method for dividing the behavior sequence comprises the following steps:

selecting characteristics: carrying out feature standardization and dimension reduction on the log data, and selecting the most effective N log features from the log data;

feature extraction: performing data conversion on the selected N log features to form a log feature set, wherein the result can be expressed as a matrix, the rows represent samples, and the columns represent feature variables;

and (3) selecting the similarity: and selecting the similarity standard, and finding the distance function which is most suitable for the characteristic type or constructing a new distance function.

Grouping: and executing a clustering algorithm to divide the normative log set into different types of behavior sequences. The input of the algorithm is a sample matrix, and the output can be a tree diagram or a specific classification scheme, so that the classification conditions are reflected in different granularities. And (4) defining a classification threshold value by means of domain knowledge to obtain a final clustering result and evaluating the clustering effectiveness.

Specifically, a hierarchical method is used for clustering log data, and a bottom-up aggregation mode is adopted. Dividing the log data into n clusters, each cluster containing a data point; calculating the distance between clusters to obtain a similarity matrix of n multiplied by n; find the two clusters with the smallest distance in the matrix:

c_m、c_lrepresenting two points in the matrix, c_i、c_jRepresenting two nearest points in the matrix, and merging c_i，c_jForming a new cluster; updating the similarity matrix, repeatingThe above steps are carried out until the termination condition is met or all data points are in one cluster.

And adopting a fully-connected complete link as a measurement standard to measure the distance between different clusters. Maximum distance dist is used for full connectivity clustering_max(c_i,c_j) As a distance criterion:

p, p' represent the two-point distance, the maximum distance being the maximum of the distance between any two points in the two clusters. The clustering process stops when the maximum distance between any two clusters exceeds a distance threshold.

And (4) similarity calculation, namely decomposing the problem requiring solution into sub-problems by adopting an algorithm based on dynamic programming, and solving the original problem after each sub-stage of solving the sub-problems is completed.

The hierarchical clustering of the aggregation mode is finally ended at a predefined ending condition or all data points are clustered to the same cluster, the ending condition needs to be determined, logs of the same type are clustered as much as possible, different types are distinguished, and original information of the data is reserved.

And step 3: pattern training: and analyzing the log stream of the training log set, and generating behavior patterns for different behavior sequences. When the last training time exceeds the length of the updating period or an updating command is received, the whole mode training process is executed again, and the specific steps are as follows:

step 3.1: conversion of log stream to frequency sequence: and converting the training log set into a frequency sequence according to different types of the training log set.

Assigning a type number to each log type, taking a training log set as input, sequentially reading each log record in the training log set, mapping the normalized log to the corresponding log type, and finally outputting the type number corresponding to the type. And a dictionary structure is constructed, so that the character string searching speed is increased, and the searching efficiency is further improved. The final result sequence contains a log time stamp and a corresponding type number, and is prepared for extracting the frequency sequence and converting the result sequence into the frequency sequence.

Step 3.2: generating a behavior pattern: the last step obtains a frequency sequence composed of different log types, namely a behavior sequence. Let the frequency sequence set be T_iComposition, i ∈ [1, N ]]And N is the number of sequence types. Extracting behavior pattern from each frequency sequence, and performing sliding window technique on the frequency sequences

And (m is the sequence length) traversing, extracting all frequency subsequences and generating a behavior mode.

First, a sliding window with length k is defined, and a frequency sequence T with length m is extracted_iAll of the frequency subsequences in (a). To refer to all frequency subsequences

For behavioral subsequences:

each frequency sequence T of length m_iContains m-k +1 behavioral subsequences, and adjacent subsequences have parts with the length of k-1 overlapped with each other. Thus, N behavior sequence sets S are obtained_i，i∈[1,N]Each behavior sequence set is represented as:

then defining a similarity measurement standard for the behavior subsequences, counting the number of the same and similar frequency subsequences, and taking the shape characteristics and the occurrence frequency of the different types of behavior subsequences as the behavior mode of the type of behavior sequences. The classification operation on similar frequency sub-sequences may use a hierarchical clustering method in log preprocessing. Periodic sum waves of sub-sequences of actions due to feature extraction operations done previouslyThe shapes are all quite regular, so that a simpler mode can be adopted, such as defining a similarity threshold value to carry out simple clustering, or taking a subsequence vector as a key value to count the occurrence times of the subsequence vector. Finally, an outlier is defined for each behavioral pattern in the set of behavioral patterns. Using the frequency of occurrence of each behavioral subsequence as a behavioral pattern parameter, a behavioral pattern that occurs less frequently must have a higher abnormal value. Therefore, the reciprocal of the appearance frequency of the behavior subsequence is taken as the behavior pattern

The outliers of (d) are noted as:

and 4, step 4: abnormality detection: and carrying out abnormal behavior detection on the real-time log stream, calculating an abnormal index, and judging the system state by comparing the abnormal index with an abnormal threshold value to obtain a log abnormal detection result. The whole analysis flow is shown as the attached figure 3, and the specific steps are as follows:

step 4.1: and (4) segmenting the log stream according to a predefined time interval and a time window, and converting the log into different types of frequency subsequences according to the clustering result.

According to a predefined unit time interval and a sliding time window with the length of k, a log sequence L of the latest k unit times is intercepted in real time, and the log sequence is divided into N log subsequences according to different log types, wherein L is { L ═ L₁,l₂,...,l_NThen using a conversion algorithm, the current log sequence is converted into a behavior sequence set D containing N elements, denoted D ═ D₁,d₂,...,d_N}。

Step 4.2: and taking the behavior sequence of the log stream and the log behavior pattern as parameters, and calculating the log abnormal value according to an abnormal detection calculation formula. And if the behavior mode reaches the updating time, executing the updating operation of the behavior mode, and calculating after the updating is finished.

The current sequence set D comprises N behavior sequences, each behavior sequence D_iCorresponding to a behavior patternSet P_i，P_iBy

Equal x behavior patterns, where d_iThe behavior pattern with the highest similarity is recorded as

Log subsequence l_iIs given by d_iAnd

and (4) jointly determining. Definition of l_iIs equal to d_iAnd

degree of dissimilarity and behavior pattern of

The sum of the outliers itself, is noted as:

while the abnormality index of the log sequence L is defined as the sum of all sub-sequence abnormality indexes, i.e.

The final anomaly index calculation formula is expressed as:

when d is_iThe behavioral model closest thereto

At a greater distance from each other, i.e.

When the value is larger, indicating the log subsequence l_iIs not similar to any behavior pattern, so the possibility of abnormality is high. l_iThe abnormality of (2) is also related to

Abnormalities in themselves are related because the nature of the closest behavioral pattern is largely representative of the nature of the sequence. Beta is a balance factor that can be adjusted during the experiment to obtain the best results. And comparing the abnormal value with the abnormal threshold value, and if the abnormal value is larger than the abnormal threshold value, sending an abnormal warning.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A system abnormity detection method based on an automatic monitoring log is characterized by comprising the following steps:

2. The method for detecting system abnormality based on an automatic monitoring log according to claim 1, wherein in step S2, the manner of normalizing the log information of the initial log set includes at least one of:

rearranging the irregular log records;

parameterizing, and replacing the numerical data with placeholders;

removing redundant characters;

the log level is converted to a digital representation.

3. The method according to claim 1, wherein in step 2, analyzing the generated features of the normative log set, performing feature extraction to obtain corresponding log feature sets, and dividing the normative log set into different types of behavior sequences according to log features, comprises:

4. The method according to claim 3, wherein the clustering algorithm is a bottom-up clustering method for clustering log data by using a hierarchical method.

5. The method for detecting system anomalies based on automated monitoring logs as claimed in claim 1, wherein in step S3, performing pattern training for each corresponding behavior sequence based on the training log set, and generating corresponding behavior patterns, includes:

6. The method for detecting system abnormality based on the automated monitoring log according to claim 1, wherein in step S4, the abnormality index is calculated by the following formula:

wherein L represents a log sequence formed by the real-time log stream,

representing a sequence of behaviors d_iAnd

of each sequence of behaviors d_iCorresponding to a behavior pattern set P_i，

Representing behavioral patterns

The outlier of (b), β, represents a balance factor.

7. The method as claimed in claim 1, wherein the updating time of the behavior pattern is set, and when the updating time is reached, the steps S1-S3 are executed again to update the behavior pattern.