CN111984515B

CN111984515B - Multi-source heterogeneous log analysis method

Info

Publication number: CN111984515B
Application number: CN202010911771.1A
Authority: CN
Inventors: 汪祖民; 田纪宇; 秦静; 季长清
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2024-01-23
Anticipated expiration: 2040-09-02
Also published as: CN111984515A

Abstract

A multi-source heterogeneous log analysis method belongs to the field of log data processing, and aims to solve the problem of log analysis, and the method comprises the following steps: determining the size of a time window according to the response time required by the information system; step 2: processing the log data in each time window into samples which can be called by an ECC log analysis algorithm by using an SGSE algorithm; step 3: training and analyzing whether the time window is normal or not by using an ECC log analysis model; step 4: and the log analysis result is presented, and the effect can carry out abnormal analysis on the log.

Description

Multi-source heterogeneous log analysis method

Technical Field

The invention belongs to the field of log data processing, and relates to a multi-source heterogeneous log analysis method and system.

Background

With the development of internet technology, the number of logs generated by each device in an information system is increasing, and analysis of logs generated by different devices and with different data characteristics is an important component of operation and maintenance work. The multi-source heterogeneous log data is analyzed by an automatic means, so that whether the running state of the information system is abnormal or normal can be timely obtained, the safe and stable running of the information system is ensured, and the running and maintenance cost of enterprises is further reduced.

In the current technical method of multi-source heterogeneous log analysis, methods of single analysis, re-aggregation, association analysis and the like are used. In the single analysis and aggregation method, logs generated in a single device in an information system are analyzed to analyze the running state of each device, and then whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. However, in the method, the logs in different devices are not analyzed in a combined way, but are analyzed after the states of the different devices are judged independently, so that the relation between the logs of the different devices cannot be mined. In the association analysis method, characteristic events are generated according to the content of each field in the log, and similarity comparison is carried out after events generated by different devices under a time window are clustered, so that similar events are removed. And then merging the similar events of different devices, and finally generating a statistical report of various events. However, in this method, with the purpose of generating a statistical report of various events, the relationships between the various events cannot be deeply mined and directly presented to the user, and each event cannot be accurately classified only by using a clustering algorithm.

In the current technical method of multi-source heterogeneous log analysis, single analysis and re-aggregation are used ^[1] Correlation analysis ^[2] And the like. In the single analysis and aggregation method, logs generated in a single device in an information system are analyzed to analyze the running state of each device, and then whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. In the association analysis method, characteristic events are generated according to the contents of all fields in a log, the events generated by different devices under a time window are subjected to similarity comparison, and the similar events are removed. And then merging the similar events of different devices, and finally generating a statistical report of various events.

In the single analysis aggregation method, the logs in different devices are not analyzed in a combined way, but are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be mined. In the association analysis method, the purpose of generating a statistical report of various events is that the relationships between various events cannot be deeply mined and directly presented to the user.

Disclosure of Invention

In order to solve the problem of log analysis, the invention provides the following technical scheme: a multi-source heterogeneous log analysis method comprises the following steps:

step 1: determining the size of a time window according to the response time required by the information system;

step 2: processing the log data in each time window into samples which can be called by an ECC log analysis algorithm by using an SGSE algorithm;

step 3: training and analyzing whether the time window is normal or not by using an ECC log analysis model;

step 4: and presenting a log analysis result.

Further, the model training steps are as follows:

step 1: generating a log quantity state sub-sequence by counting the log quantity of multi-source heterogeneous log data in a normal time window and an abnormal time window, generating a user behavior state sub-sequence by counting the quantity of each log type generated on each device in the time window, and generating a field state sub-sequence by counting the quantity of times of types appearing in some important fields of each device in the time window;

step 2: generating (n+m+j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence under each time window;

step 3: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a log quantity state subsequence in each normal and abnormal time window as a label and sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:

v1＝1-v2 (3)

f(tableα)＝v1*M′ _tableα +v2*M″ _tableα +bias (4)

table alpha represents the corresponding sample of a feature in the log quantity state subsequence when it is tagged,

table alpha' represents the corresponding sample when a feature in the log quantity state subsequence is used as a tag in the normal time window,

table alpha "represents a sample corresponding to a feature in the log quantity state subsequence under the abnormal time window as a tag,

M _tableα ' is the mean square error between the corresponding sample when a feature in the log number state subsequence is used as a label and the corresponding sample when a feature in the log number state subsequence is used as a label under the representative normal time window,

M _tableα "mean square error between the corresponding sample when a certain feature in the log number state subsequence is used as a label and the corresponding sample when a certain feature in the log number state subsequence under the representative abnormal time window is used as a label,

bias is the term "paranoid",

v1 and v2 are respectively the change coefficients of mean square error of a normal time window and an abnormal time window, and v1=v2 during training;

f (table alpha) is a difference value calculated by taking a certain feature in a log quantity state subsequence in a trained time window as a label;

step 4: calculating the difference value of each normal time window through the formulas (1) - (4) and other normal and abnormal time windows and storing the difference value as a set U1, calculating the difference value of the abnormal time window through the formulas (1) - (4) and the normal and abnormal time windows and storing the difference value as P1, and obtaining the confidence interval sigma (alpha) of the log quantity state subsequence under the normal time window as sigma (alpha) = [ min (U1), max (U1) ]n [ min (P1), max (P1) ]

Step 5: the sample data set taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a label is respectively and pairwise calculated to obtain a difference value with the sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:

v3＝1-v4 (7)

f(tableβ)＝v3*M′ _tableβ +v4*M″ _tableβ +bias (8)

table beta represents the sample corresponding to a feature in the user behavior state subsequence when it is labeled,

table beta' represents a sample corresponding to a feature in the user behavior state sub-sequence under the normal time window as a tag,

the table beta "represents a sample corresponding to a feature in the user behavior state sub-sequence under the abnormal time window as a tag,

M _tableβ ' is the mean square error between the corresponding sample when a feature in the user behavior state sub-sequence is used as a label and the corresponding sample when a feature in the user behavior state sub-sequence under the representative normal time window is used as a label,

M _tableβ "mean square error between the sample corresponding to the user behavior state subsequence when a certain feature in the user behavior state subsequence is used as a label and the sample corresponding to the user behavior state subsequence under the representative abnormal time window when a certain feature in the user behavior state subsequence is used as a label,

bias is the term "paranoid",

v3 and v4 are the change coefficients of mean square error of the normal time window and the abnormal time window respectively, v3=v4 during training,

f (table beta) is a difference value calculated by taking a certain feature in a user behavior state subsequence in a trained time window as a label;

step 6: calculating the difference value of each normal time window and the normal time window through the steps (5) - (8) and the abnormal time window and storing the difference value as a set U ₂ Calculating the difference value between the abnormal time window and the normal time window according to the formulas (5) - (8) and storing the difference value as P ₂ Obtaining confidence interval sigma (beta) of field state subsequence under normal time window as

σ(β)＝[min(U2)，max(U2)]∩[min(P2)，max(P2)]

Step 7: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a field state subsequence in each normal and abnormal time window as a label and sample sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:

v5＝1-v6 (11)

f(tableγ)＝v5*M′ _t a _bleγ +v6*M″ _tableγ +bias (12)

table gamma represents the corresponding sample when a feature in the field state subsequence is a tag,

table y' represents the corresponding sample when a feature in the field state subsequence is the tag under the normal time window,

table y "represents the corresponding sample when a feature in the field state subsequence under the abnormal time window is used as a tag,

the Mtable y' is the mean square error between the corresponding sample when a certain feature in the field state subsequence is used as a tag and the corresponding sample when a certain feature in the field state subsequence under the representative normal time window is used as a tag,

mtable gamma "is the mean square error between the corresponding sample when a feature in the field state subsequence is used as a tag and the corresponding sample when a feature in the field state subsequence under the representative anomaly time window is used as a tag,

bias is the term "paranoid",

v5, v6 are the coefficients of variation of the mean square error of the normal time window, the abnormal time window, respectively, v5=v6 during training,

f (table gamma) is a difference value calculated by taking a certain feature in a field state subsequence in a trained time window as a label;

step 8: calculating the difference value of each normal time window and the normal time window through (9) - (12) and saving the difference value as a set U ₃ Calculating the difference value between the abnormal time window and the normal time window according to the formulas (9) - (12) and storing the difference value as P ₃ Obtaining confidence interval sigma (gamma) of log quantity state subsequence under normal time window as sigma (gamma) = [ min (U3), max (U3)]∩[min(P3)，max(P3)]。

Further, the log analysis method comprises the following steps:

step 1: when analyzing a detected time window, randomly selecting a characteristic of each subsequence to form three labels and forming three samples to represent the detected time window;

step 2: assuming an initial value v1=v2, v3=v4, v5=v6, and calculating whether the difference values f (table α), f (table β), f (table γ) of three samples under the detected time window are within the corresponding confidence intervals σ (α), σ (β), σ (γ) according to formulas (1) - (12), respectively;

step 3: if all three samples are in the confidence interval, the v1, v2, v3, v4, v5 and v6 are constrained, and the constraint formula is that

v1＝a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)

v1, v3 and v5 are respectively the change coefficients of mean square errors of normal time windows, v1, v3 and v5 are reduced, the influence of the normal time windows on the difference values is reduced, the influence of the abnormal time windows on the difference values is increased, according to new v1, v2, v3, v4, v5 and v6 and three samples, the formulas (1) - (12) are re-passed, whether the difference values of the three samples under the detected time windows are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is calculated, and the number of repeated constraint times is determined according to the missing report rate requirement;

step 4: if the difference value of the three samples under the receiving time window is still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the receiving time window, otherwise, the information system is considered to be abnormal.

The beneficial effects are that: the invention provides an SGSE algorithm for processing a multi-source heterogeneous log into a multi-dimensional sample which can represent the state of an information system for analysis by the algorithm. The invention provides an ECC algorithm for multi-source heterogeneous logs, which can analyze multi-dimensional samples and analyze the running state of an information system of the time window according to the multi-source heterogeneous logs under the time window.

Drawings

Fig. 1: a multi-source heterogeneous log analysis flow chart.

Fig. 2: the window state sample under test generates a graph.

Detailed Description

The invention provides a method for processing multi-source heterogeneous log data in an information system by SGSE (State Generation Sequential Extraction, state generation sequence extraction) algorithm, and provides a new ECC (Error Coefficient Constraint ) algorithm for judging the running state in the information system aiming at the characteristics of the multi-source heterogeneous log of the information system. Referring to fig. 1, the multi-source heterogeneous log analysis method of the present invention includes the following steps:

step 1: the size of the time window is determined based on the response time required by the information system.

Step 2: the logs in each time window are processed by using an SGSE algorithm, and log data in each time window are processed into samples.

Step 3: training and analyzing a time window to be analyzed by using an ECC log analysis model.

Step 4: and presenting a log analysis result.

In a preferred embodiment, the training of the ECC log analysis model is as follows:

v1＝1-v2 (3)

f(tableα)＝v1*M′ _tablrα +v2*M″ _tableα +bias (4)

bias is the term "paranoid",

step 4: each normal time window is passed through (1)(4) Calculating difference value with other normal and abnormal time windows and storing the difference value as a set U ₁ Calculating the difference value between the abnormal time window and the normal time window according to the formulas (1) - (4) and storing the difference value as P ₁ Obtaining confidence interval sigma (alpha) of log quantity state subsequence under normal time window as sigma (alpha) = [ min (U1), max (U1)]∩[min(P1)，max(P1)]

v3＝1-v4 (7)

f(tableβ)＝v3*M′ _tableβ +v4*M″ _tableβ +bias (8)

M _tableβ′ the mean square error between the corresponding sample when a certain feature in the user behavior state subsequence is used as a label and the sample when the certain feature in the user behavior state subsequence is used as a label under the representative normal time window,

M _tableβ″ sample and representative abnormal time window corresponding to a feature in a user behavior state sub-sequence serving as a labelA certain feature in the user behavior state sub-sequence acts as a mean square error between corresponding samples of the tag,

bias is the term "paranoid",

σ(β)＝[min(U2)，max(U2)]∩[min(P2)，max(P2)]

v5＝1-v6 (11)

f(tableγ)＝v5*M′ _tableγ +v6*M″ _tableγ +bias (12)

M _tableγ ' mean square error between a sample corresponding to a feature in the field state subsequence when the feature is tagged and a sample corresponding to a feature in the field state subsequence under the representative normal time window when the feature is tagged,

M _tableγ "mean square error between the corresponding sample when a feature in the field state subsequence is the tag and the corresponding sample when a feature in the field state subsequence under the representative anomaly time window is the tag,

bias is the term "paranoid",

In a preferred embodiment, the method of log analysis comprises the steps of:

v1＝a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)

The invention also provides a log abnormality detection system, which comprises:

and the time window dividing module is used for determining the size of the time window according to the requirement of the information system on response time.

And the SGSE data processing module is used for processing the log data and processing the log data into sample data which can be called by the ECC log analysis model according to the time window.

And the ECC model training module is used for training an ECC log analysis model.

The ECC log analysis module is used for judging whether the tested time window is normal according to the tested time window analyzed by the ECC log analysis model, and analyzing whether the state of the information system under the time window is normal according to the logs of all the devices in the information system.

In one scheme, the time window dividing module determines the size of the time window as short as possible in response time of the user to the information system requirement.

In one scheme, the SGSE data processing module is divided into an SG state generating sub-module and an SE sequence extracting sub-module.

The SG state generation submodule determines each device in the information system such as WAF, load balancing, firewall and the like. And counting the log quantity of each device to generate a log quantity state subsequence, wherein the subsequence comprises the characteristics of WAF device log quantity, load balancing device log quantity and the like.

Table 1: log quantity state subsequence

Time window/device type	WAF log number	Load balancing log quantity	Firewall log number	......	Nginx log number
						Time window N	α ₁	α ₂	α ₃	......	α _n

Counting the number of each log category generated on each device in a time window to obtain a user behavior state subsequence, wherein the category determination is obtained by merging the types of fields with each entropy value not being 0, for example, an action field exists in WAF devices for recording the behavior of the WAF devices on access, and the types in the fields are alert and block; there are four types 200, 404, 500, 501 in the http_method field that records the http state, and from these two fields WAF devices can generate 2 x 4 different kinds of logs. The sub-sequence contains a number of features such as the number of WAF log categories, the number of firewall log categories, etc.

Table 2: user behavior state subsequence

And counting the number of times of occurrence of the types in some important fields of each device in the time window, and generating a field state sub-sequence. The field state subsequence contains features such as the number of action field alert types in the WAF, the number ratio of firewall protocol field TCP protocols, and the like.

Table 3: field state subsequence

The SE sequence extraction submodule sequentially extracts one feature in one subsequence as a label under the same time window, and combines the feature with all features of other two subsequences to form a sample, as shown in the following table:

table 4: WAF log number in log number state subsequence is tag feature merge table

Table 5: user behavior state subsequence WAF log category 1 as tag feature merge table

Table 6: the field state subsequence WAF action field alert number is a tag feature merge table

Through the SE sequential extraction sub-module, any feature in any one sub-sequence can be enabled to have all features in other two sub-sequences corresponding to the features. A correlation is produced between the generated sample features. When analyzing a time window, a feature of each sub-sequence is randomly selected to form three tags representing the received time window.

In one scheme, the ECC model training module is divided into a small number of normal and abnormal event window determining sub-modules and an ECC model training sub-module.

The small number of normal and abnormal event windows determine sub-modules. The problem of inaccurate classification is amplified to influence the analysis result when the analysis model is trained. In the existing information system history data, a small amount of time windows can be accurately determined to be normal or abnormal time windows according to professional knowledge.

The ECC model training submodule comprises the following steps of:

step 1: processing the multi-source heterogeneous log data in the normal and abnormal time windows through an SG state generation sub-module to obtain log quantity statistics of the log data to generate a log quantity state sub-sequence, carrying out statistics on the quantity of each log type generated on each device in the time window to generate a user behavior state sub-sequence, and carrying out quantity statistics on the number of times of occurrence of types in some important fields of each device in the time window to generate a field state sub-sequence.

Step 2: and extracting the sub-modules according to the SE sequence from n features in the log quantity state sub-sequence, m features in the user behavior state sub-sequence and j features in the field state sub-sequence under each time window to generate (n+m+j) sample data sets.

v1＝1-v2 (3)

f(tablea)＝v1*M′ _tableeα +v2*M″ _tableα +bias (4)

bias is the term "paranoid",

step 4: calculating the difference value of each normal time window and other normal time windows according to the formulas (1) - (4) and storing the difference value as a set U ₁ Calculating the difference value between the abnormal time window and the normal time window according to the formulas (1) - (4) and storing the difference value as P ₁ Obtaining confidence interval sigma (alpha) of log quantity state subsequence under normal time window as sigma (alpha) = [ min (U1), max (U1)]∩[min(P1)，max(P1)]

v3＝1-v4 (7)

f(tableβ)＝v3*M′ _tableβ +v4*M″ _tableβ +bias (8)

M _tableβ ' when a feature in a sub-sequence of behavior states of a user is used as a labelThe mean square error between the corresponding samples and the corresponding samples representing a feature in the user behavior state sub-sequence under the normal time window as a label,

bias is the term "paranoid",

σ(β)＝[min(U2)，max(U2)]∩[min(P2)，max(P2)]

v5＝1-v6 (11)

f(tableγ)＝v5*M′ _tableγ +v6*M″ _tableγ +bias (12)

bias is the term "paranoid",

In one scheme, the ECC log analysis module comprises the following analysis steps:

step 1: when analyzing a time window for the detected window, one feature of each sub-sequence is randomly selected to be three labels and three samples are formed to represent the detected time window, and the samples are shown in fig. 2.

Step 2: assuming that the initial values v1=v2, v3=v4 and v5=v6, the difference values f (table α), f (table β) and f (table γ) of the three samples under the tested window are respectively calculated according to the formulas (1), (2), (3) and (4) described by the ECC model training submodule, and whether the difference values f (table α), f (table β) and f (table γ) are within the corresponding confidence intervals σ (α), σ (β) and σ (γ) is determined.

Step 3: if all three samples are in the confidence interval, the detection precision is improved by restraining v1, v2, v3, v4, v5 and v6, wherein the restraining formula is that

v1＝a1*v1(0≤a＜1)

v3＝a2*v3(0≤a2＜1)

v5＝a3*v5(0≤a3＜1)

v1, v3 and v5 are respectively changes of the normal time window, and when v1, v3 and v5 are reduced, the influence of the normal time window on the difference value is reduced, and the influence of the abnormal event window on the difference value is increased. The obtained new v1, v2, v3, v4, v5 and v6 and the three samples are re-brought into the formulas (1) - (12) described by the ECC model training submodule to check whether the difference value is in the confidence interval or not, the number of repeated constraint times is determined according to the preset missing report rate requirement, and if the missing report rate requirement is low, the number of constraint times is small. And the number of constraint times is more if the report missing rate is low.

Step 4: if the difference value of the time window is still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the received time window, otherwise, the information system is considered to be abnormal.

In the single analysis aggregation method, the logs in different devices are not analyzed in a combined way, but are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be mined. The SGSE algorithm provided by the invention can effectively process logs among different devices and aggregate the logs into a sample capable of showing the state of the information system in the time window, comprehensively judging the overall state of the information system, and not analyzing a single device to obtain a result and then aggregating.

In the association analysis method, the purpose of generating a statistical report of various events is that the relationships between various events cannot be deeply mined and directly presented to the user. The multisource heterogeneous logs are subjected to deep analysis through an SGSE-ECC algorithm, statistical reports of various events are generated through clustering, and the states of the information system are visually presented to a user through relations among the deep mining logs.

In general, the invention uses the SGSE-ECC algorithm model to process data, generate samples, analyze states and automatically analyze logs for the multi-source heterogeneous logs, thereby reducing operation and maintenance cost.

The SGSE algorithm provided by the invention processes log data on each device, aggregates and generates samples capable of showing the state of a time window, and provides an ECC algorithm to analyze multidimensional samples, and the detection precision and the missing report rate are adjusted through constraint change coefficients.

Claims

1. The multi-source heterogeneous log analysis method is characterized by comprising the following steps of:

step 4: presenting a log analysis result;

the model training steps are as follows:

v1＝1-v2 (3)

f(tableα)＝v1*M′ _tableα +v2*M″ _tableα +bias (4)

bias is the term "paranoid",

step 4: calculating the difference value of each normal time window and other normal time windows according to the formulas (1) - (4) and storing the difference value as a set U ₁ Calculating the difference value between the abnormal time window and the normal time window according to the formulas (1) - (4) and storing the difference value as P ₁ Obtaining confidence interval sigma (alpha) of log quantity state subsequence under normal time window as

σ(α)＝[min(U1),max(U1)]∩[min(P1),max(P1)]

v3＝1-v4 (7)

f(tableβ)＝v3*M′ _tableβ +v4*M″ _tableβ +bias (8)

M _tableβ ' sample corresponding to when a certain feature in the sub-sequence of user behavior state is taken as a label and sample corresponding to when a certain feature in the sub-sequence of user behavior state under the representative normal time window is taken as a labelThe mean square error between the two,

bias is the term "paranoid",

step 6: calculating the difference value of each normal time window and the normal time window through the steps (5) - (8) and the abnormal time window and storing the difference value as a set U ₂ Calculating the difference value between the abnormal time window and the normal time window according to the formulas (5) - (8) and storing the difference value as P ₂ Obtaining confidence interval sigma (beta) of field state subsequence under normal time window as sigma (beta) = [ min (U2), max (U2)]∩[min(P2),max(P2)]

v5＝1-v6 (11)

f(tableγ)＝v5*M′ _tableγ +v6*M″ _tableγ +bias (12)

bias is the term "paranoid",

step 8: calculating the difference value of each normal time window and the normal time window through (9) - (12) and saving the difference value as a set U ₃ Calculating the difference value between the abnormal time window and the normal time window according to the formulas (9) - (12) and storing the difference value as P ₃ Obtaining confidence interval sigma (gamma) of log quantity state subsequence under normal time window as

σ(γ)＝[min(U3),max(U3)]∩[min(P3),max(P3)]。

2. The multi-source heterogeneous log analysis method of claim 1, wherein the log analysis method comprises the steps of:

step 1, when analyzing a detected time window, randomly selecting a characteristic of each subsequence to form three labels and forming three samples to represent the detected time window;

v1＝a1*v1(0≤a<1)

v3＝a2*v3(0≤a2<1)

v5＝a3*v5(0≤a3<1)