CN111984516A

CN111984516A - Log anomaly detection system based on SGSE-ECC

Info

Publication number: CN111984516A
Application number: CN202010911782.XA
Authority: CN
Inventors: 汪祖民; 田纪宇; 秦静; 季长清
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2020-11-24
Anticipated expiration: 2040-09-02
Also published as: CN111984516B

Abstract

The log anomaly detection system based on the SGSE-ECC belongs to the field of log data processing, and aims to solve the problem of log analysis, the time window division module is used for determining the size of a time window according to the requirement of an information system on response time; the SGSE data processing module is used for forming the log data into sample data for being called by an ECC log analysis algorithm according to a time window; the ECC model training module is used for training an ECC log analysis model; and the ECC log analysis module is used for analyzing whether the state of the information system under the current time window is normal or not according to the log data of each device in the information system, and the effect is that the log can be analyzed abnormally.

Description

Log anomaly detection system based on SGSE-ECC

Technical Field

The invention belongs to the field of log data processing, and relates to a log anomaly detection system based on SGSE-ECC.

Background

With the development of internet technology, the number of logs generated by each device in an information system is increasing, and analyzing logs generated by different devices and with different data characteristics is an important component of operation and maintenance work. The log data of the multisource isomerism are analyzed through an automatic means, the running state of the information system can be timely known to be abnormal or normal, the safe and stable running of the information system is ensured, and the operation and maintenance cost of enterprises is further reduced.

In the current technical method of multi-source heterogeneous log analysis, methods such as single analysis re-aggregation and correlation analysis are used. In the single analysis and aggregation method, a log generated in a single device in an information system is analyzed, the running state of each device is analyzed, and whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. However, when the method is used for analysis, the logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged separately, and the relationship among the logs of the different devices cannot be mined. In the correlation analysis method, characteristic events are generated according to the contents of all fields in the log, the events generated by different devices under a time window are clustered, then similarity comparison is carried out, and similar events are removed. Then, similar events of different devices are combined, and finally, statistical reports of various events are generated. However, in this method, for the purpose of generating statistical reports of various events, the relationship between various events cannot be deeply mined and directly presented to the user, and each event cannot be accurately classified only by using a clustering algorithm.

In the current technical method of multi-source heterogeneous log analysis, single analysis re-aggregation is used^[1]Correlation analysis^[2]And the like. In the single analysis and aggregation method, a log generated in a single device in an information system is analyzed, the running state of each device is analyzed, and whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. In the correlation analysis method, characteristic events are generated according to the contents of all fields in the log, similarity comparison is carried out on the events generated by different devices under a time window, and similar events are removed. Then, similar events of different devices are combined, and finally, statistical reports of various events are generated.

In the single analysis and aggregation method, logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be excavated. In the method of correlation analysis, aiming at generating the statistical report of various events, the relationship between various events cannot be deeply mined and directly presented to the user.

Disclosure of Invention

In order to solve the problem of log analysis, the invention provides the following technical scheme: an SGSE-ECC-based log anomaly detection system, comprising:

the time window dividing module is used for determining the size of a time window according to the requirement of the information system on response time;

the SGSE data processing module is used for forming the log data into sample data for being called by an ECC log analysis algorithm according to a time window;

the ECC model training module is used for training an ECC log analysis model;

and the ECC log analysis module is used for analyzing whether the state of the information system under the current time window is normal or not according to the log data of each device in the information system.

Further, the SGSE data processing module comprises

The SG state generation submodule counts the log number of each device to generate a log number state subsequence; counting the number of each log type generated on each device in a time window to generate a user behavior state subsequence; counting the number of times of occurrence of types in some important fields of each device in a time window to generate a field state subsequence;

the SE sequence extraction sub-module is used for sequentially extracting one feature in one subsequence as a label under the same time window, combining the feature with all the features of other two subsequences into a sample, enabling any feature in any subsequence to have all the features in other two subsequences corresponding to the feature, and randomly selecting one feature of each subsequence to form three labels to represent the time window to be detected when one time window is analyzed;

further, the ECC model training module comprises

The model training of the ECC model training submodule comprises the following steps:

step 1: counting the number of logs of multi-source heterogeneous log data in a normal time window and an abnormal time window to generate a log number state subsequence, counting the number of types of each log generated on each device in the time window to generate a user behavior state subsequence, and counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence;

step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;

and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:

v1＝1-v2 (3)

f(tableα)＝v1*M′_tableα+v2*M″_tableα+bias (4)

table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,

table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,

table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,

M_tableα' is a sample corresponding to a feature in the log quantity state subsequence as a label and a sample representing a feature in the log quantity state subsequence as a label in a normal time windowThe mean square error between the two signals,

M_tableα"is the mean square error between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence as a label in the abnormal time window,

the bias is the bias to be executed,

v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;

f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;

and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U₁Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P₁Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as

σ(α)＝[min(U1)，max(U1)]∩[min(P1)，max(P1)]

And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:

v3＝1-v4 (7)

f(tableβ)＝v3*M′_tableβ+v4*M″_tableβ+bias (8)

table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,

table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,

table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,

M_tableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,

M_tableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,

the bias is the bias to be executed,

v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,

f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;

step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U₂Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P₂Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as

σ(β)＝[min(U2)，max(U2)]∩[min(P2)，max(P2)]

And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:

v5＝1-v6 (11)

f(tableγ)＝v5*M′_tableγ+v6*M″_tableγ+bias (12)

table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,

table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,

table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,

M_tableγ' is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the normal time window,

M_tableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,

the bias is the bias to be executed,

v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,

f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;

and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U₃Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P₃Obtaining the device of the log quantity state subsequence under the normal time windowThe signal interval σ (γ) is

σ(γ)＝[min(U3)，max(U3)]∩[min(P3)，max(P3)]。

Further, the log analysis module comprises the following model analysis steps:

step 1, when a detected time window is analyzed, randomly selecting a feature of each subsequence to form three labels and forming three samples to represent the detected time window;

step 2: setting the initial values v 1-v 2, v 3-v 4 and v 5-v 6, and respectively calculating whether the difference values f (table α), f (table β) and f (table γ) of the three samples in the detected time window are within corresponding confidence intervals σ (α), σ (β) and σ (γ) through formulas (1) to (12);

and step 3: if all three samples are within the confidence interval, the constraint formulas for v1, v2, v3, v4, v5 and v6 are as follows

v1＝a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)

v1, v3 and v5 are respectively the variation coefficients of the mean square error of the normal time window, and v1, v3 and v5 are reduced, so that the influence of the normal time window on the difference value is reduced, the influence of the abnormal time window on the difference value is increased, whether the difference values of the three samples under the tested time window are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is respectively calculated according to the new v1, v2, v3, v4, v5 and v6 and the three samples through the formulas (1) to (12), and the times of repeated constraint are determined according to the missing report rate requirement;

and 4, step 4: and if the difference values of the three samples under the tested time window are still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.

Has the advantages that: the invention provides an SGSE algorithm for processing multi-source heterogeneous logs into multi-dimensional samples capable of representing the state of an information system for algorithm analysis. The invention provides an ECC algorithm for multi-source heterogeneous logs, which can analyze multi-dimensional samples and analyze the running state of an information system of the time window according to the multi-source heterogeneous logs under the time window.

Drawings

FIG. 1: and (4) a multi-source heterogeneous log analysis flow chart.

FIG. 2: the measured window status samples generate a graph.

Detailed Description

The invention provides a method for processing multi-source heterogeneous log data in an information system by using an SGSE (State Generation Sequential Extraction) algorithm, and provides a new ECC (Error Coefficient Constraint) algorithm for judging an operation State in the information system aiming at the characteristics of the multi-source heterogeneous log of the information system. Referring to fig. 1, the multi-source heterogeneous log analysis method of the present invention includes the following steps:

step 1: the size of the time window is determined according to the response time required by the information system.

Step 2: and processing the logs in each time window by using an SGSE algorithm, and processing the log data in each time window into samples.

And step 3: and training and analyzing the time window needing to be analyzed by using an ECC log analysis model.

And 4, step 4: and presenting the log analysis result.

In a preferred embodiment, the steps of the ECC log analysis model training are as follows:

v1＝1-v2 (3)

f(tableα)＝v1*M′_tableα+v2*M″_tableα+bias (4)

M_tableα' is the mean square error between the sample corresponding to a feature in the log quantity state subsequence as a label and the sample corresponding to a feature in the log quantity state subsequence as a label in the normal time window,

the bias is the bias to be executed,

σ(α)＝[min(U1)，max(U1)]∩[min(P1),max(P1)]

v3＝1-v4 (7)

f(tableβ)＝v3*M′_tableβ+v4*M″_tableβ+bias (8)

the bias is the bias to be executed,

σ(β)＝[min(U2)，max(U2)]∩[min(P2)，max(P2)]

v5＝1-v6 (11)

f(tableγ)＝v5*M′_tableγ+v6*M″_tableγ+bias (12)

the bias is the bias to be executed,

and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U₃Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P₃Obtaining a confidence interval sigma (gamma) of the log quantity state subsequence under the normal time window as

σ(γ)＝[min(U3)，max(U3)]∩[min(P3)，max(P3)]。

In a preferred embodiment, the log analysis method comprises the following steps:

v1＝a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)

The invention also provides a log anomaly detection system, which comprises:

and the time window dividing module is used for determining the size of the time window according to the requirement of the information system on the response time.

And the SGSE data processing module is used for processing the log data and processing the log data into sample data which can be called by the ECC log analysis model according to the time window.

And the ECC model training module is used for training the ECC log analysis model.

And the ECC log analysis module is used for judging whether the time window to be tested is normal according to the time window to be tested analyzed by the ECC log analysis model, and analyzing whether the state of the information system under the time window is normal according to the logs of all the equipment in the information system.

In one scheme, the time window dividing module determines the size of the time window as short as possible according to the response time required by the user for the information system.

In one scheme, the SGSE data processing module is divided into an SG state generation submodule and an SE sequence extraction submodule.

And the SG state generation submodule determines various devices in the information system, such as WAF, load balance, firewall and the like. And counting the log number of each device to generate a log number state subsequence, wherein the subsequence comprises the characteristics of the WAF device log number, the load balancing device log number and the like.

Table 1: log quantity status subsequence

Time window/device type	WAF Log quantity	Load balancing log number	Firewall log number	……	Number of Nginx logs
						Time window N	α₁	α₂	α₃	……	α_n

Counting the number of each log category generated on each device in a time window to obtain a user behavior state subsequence, wherein the category determination is obtained by combining the types of each field with the entropy value not being 0, for example, an action field exists in the WAF device and is used for recording the behavior of the WAF device on access, and the types in the fields include alert and block; there are four types 200, 404, 500, 501 in the http _ method field for recording http status, and 2 × 4 different kinds of logs can be generated by the WAF device from the two fields. The subsequence contains a plurality of characteristics such as the number of WAF log types and the number of firewall log types.

Table 2: user behavior state subsequence

And counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence. The field state subsequence contains the number of action field alert types in the WAF, the number ratio of firewall protocol field TCP protocols and other characteristics.

Table 3: field state subsequence

The SE sequential extraction sub-module sequentially extracts a feature in one sub-sequence as a tag in the same time window, and combines the tag with all features of the other two sub-sequences to form a sample, as shown in the following table:

table 4: merging table for characteristic of WAF log quantity as tag in log quantity state subsequence

Table 5: user behavior state subsequence WAF log type 1 as label feature merging table

Table 6: field state subsequence WAF action field alert number as tag feature merge table

By the SE sequential extraction sub-module, any feature in any one sub-sequence can have all features in the other two sub-sequences corresponding to it. An association is made between the generated sample features. When a time window is analyzed, one feature of each subsequence is randomly selected to form three labels representing the time window to be measured.

In one scheme, the ECC model training module is divided into a small number of normal and abnormal event window determination sub-modules and an ECC model training sub-module.

And the small number of normal and abnormal event windows determine a submodule. The classification can not be accurately carried out only by using a clustering algorithm, and the problem of inaccurate classification during the training of an analysis model is amplified to influence the analysis result. A small number of time windows can be accurately determined to be normal or abnormal time windows in the existing information system historical data according to professional knowledge.

The ECC model training submodule comprises the following steps of:

step 1: processing multi-source heterogeneous log data in normal and abnormal time windows through an SG state generation sub-module to obtain log quantity statistics generation log quantity state sub-sequences of the log data, performing quantity statistics on the number of types of each log generated on each device in the time windows to generate user behavior state sub-sequences, and performing quantity statistics on the number of times of types appearing in some important fields of each device in the time windows to generate field state sub-sequences.

Step 2: and (2) extracting the n characteristics in the log quantity state subsequence, the m characteristics in the user behavior state subsequence and the j characteristics in the field state subsequence in the time window according to the SE sequence to generate (n + m + j) sample data sets.

v1＝1-v2 (3)

f(tableα)＝v1*M′_tableα+v2*M″_tableα+bias (4)

M_tableα"is a sample corresponding to a feature in the log quantity status subsequence as a label and represents a feature in the log quantity status subsequence in the abnormal time windowIs the mean square error between the samples corresponding to the tag,

the bias is the bias to be executed,

σ(α)＝[min(U1)，max(U1)]∩[min(P1),max(P1)]

v3＝1-v4 (7)

f(tableβ)＝v3*M′_tableβ+v4*M″_tableβ+bias (8)

the bias is the bias to be executed,

σ(β)＝[min(U2),max(U2)]∩[min(P2),max(P2)]

v5＝1-v6 (11)

f(tableγ)＝v5*M′_tableγ+v6*M″_tableγ+bias (12)

the bias is the bias to be executed,

σ(γ)＝[min(U3),max(U3)]∩[min(P3),max(P3)]。

In one aspect, the ECC log analysis module comprises the following steps:

step 1, when a time window is analyzed for a detected window, one feature of each subsequence is randomly selected as three labels and three samples representing the detected time window are formed, wherein the samples are shown in figure 2.

Step 2: assuming that the initial value v1 is v2, v3 is v4, and v5 is v6, the differences f (table α), f (table β), and f (table γ) of the three samples in the test window are respectively calculated by the formulas (1), (2), (3), and (4) of the ECC model training submodule to determine whether the differences f (table α), f (table β), and f (table γ) are within the corresponding confidence intervals σ (α), σ (β), and σ (γ).

And step 3: if the three samples are all in the confidence interval, the detection accuracy is improved by constraining v1, v2, v3, v4, v5 and v6, and the constraint formula is

v1＝a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)

v1, v3 and v5 are changes of the normal time window respectively, and if v1, v3 and v5 are reduced, the influence of the normal time window on the difference value is reduced, and the influence of the abnormal event window on the difference value is increased. And the obtained new v1, v2, v3, v4, v5 and v6 and the three samples are re-introduced into an ECC model training submodule, and the formulas (1) to (12) are used for checking whether the difference values are in a confidence interval, determining the repeated constraint times according to the preset missing report rate requirement, and if the missing report rate requirement is low, the constraint times are few. And if the missing report rate requirement is low, the constraint times are large.

And 4, step 4: and if the difference value of the time window is still in the confidence interval after the repeated restriction is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.

In the single analysis and aggregation method, logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be excavated. The SGSE algorithm provided by the invention can effectively process logs among different devices and aggregate the logs into a sample capable of reflecting the state of the information system in the time window, and comprehensively judges the overall state of the information system, rather than aggregating after analyzing a single device to obtain a result.

In the method of correlation analysis, aiming at generating the statistical report of various events, the relationship between various events cannot be deeply mined and directly presented to the user. Deep analysis is carried out on the multi-source heterogeneous logs through an SGSE-ECC algorithm, not only are statistical reports of various events generated through clustering, but also the relationship among the deep mining logs visually presents the state of an information system to a user.

Generally, the SGSE-ECC algorithm model is used for carrying out data processing, sample generation and state analysis on the multi-source heterogeneous log, the log is automatically analyzed, and the operation and maintenance cost is reduced.

The SGSE algorithm provided by the invention processes log data on each device, generates samples capable of reflecting the time window state in an aggregation mode, provides an ECC algorithm to analyze the multidimensional samples, and adjusts the detection precision and the missing report rate through restricting the change coefficient.

Claims

1. An SGSE-ECC-based log anomaly detection system, comprising:

the ECC model training module is used for training an ECC log analysis model;

2. The SGSE-ECC-based log anomaly detection system according to claim 1, wherein said SGSE data processing module comprises

and the SE sequence extraction sub-module is used for sequentially extracting one feature in one subsequence as a label in the same time window, combining the feature with all the features of the other two subsequences into a sample, enabling any feature in any one subsequence to correspond to all the features in the other two subsequences, and randomly selecting one feature of each subsequence to form three labels to represent the time window to be detected when one time window is analyzed.

3. The SGSE-ECC-based log anomaly detection system of claim 1, wherein the ECC model training module comprises

v1＝1-v2 (3)

f(tableα)＝v1*M′_tableα+v2*M″_tableα+bias (4)

the bias is the bias to be executed,

σ(α)＝[min(U1)，max(U1)]∩[min(P1)，max(P1)]

v3＝1-v4 (7)

f(tableβ)＝v3*M′_tableβ+v4*M″_tableβ+bias (8)

the bias is the bias to be executed,

σ(β)＝[min(U2)，max(U2)]∩[min(P2)，max(P2)]

v5＝1-v6 (11)

f(tableγ)＝v5*M′_tableγ+v6*M″_tableγ+bias (12)

M_tableγ' is field state subThe mean square error between the sample corresponding to a certain feature in the sequence as a label and the sample corresponding to a certain feature in the field state subsequence under the normal time window as a label,

the bias is the bias to be executed,

σ(γ)＝[min(U3)，max(U3)]∩[min(P3)，max(P3)]。

4. The SGSE-ECC-based log anomaly detection system according to claim 1, wherein the log analysis module performs a model analysis by:

step 1: when a detected time window is analyzed, randomly selecting one feature of each subsequence to form three labels and forming three samples to represent the detected time window;

v1＝a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)