CN111984516A - Log anomaly detection system based on SGSE-ECC - Google Patents

Log anomaly detection system based on SGSE-ECC Download PDF

Info

Publication number
CN111984516A
CN111984516A CN202010911782.XA CN202010911782A CN111984516A CN 111984516 A CN111984516 A CN 111984516A CN 202010911782 A CN202010911782 A CN 202010911782A CN 111984516 A CN111984516 A CN 111984516A
Authority
CN
China
Prior art keywords
time window
log
normal
label
subsequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010911782.XA
Other languages
Chinese (zh)
Other versions
CN111984516B (en
Inventor
汪祖民
田纪宇
秦静
季长清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN202010911782.XA priority Critical patent/CN111984516B/en
Publication of CN111984516A publication Critical patent/CN111984516A/en
Application granted granted Critical
Publication of CN111984516B publication Critical patent/CN111984516B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging

Abstract

The log anomaly detection system based on the SGSE-ECC belongs to the field of log data processing, and aims to solve the problem of log analysis, the time window division module is used for determining the size of a time window according to the requirement of an information system on response time; the SGSE data processing module is used for forming the log data into sample data for being called by an ECC log analysis algorithm according to a time window; the ECC model training module is used for training an ECC log analysis model; and the ECC log analysis module is used for analyzing whether the state of the information system under the current time window is normal or not according to the log data of each device in the information system, and the effect is that the log can be analyzed abnormally.

Description

Log anomaly detection system based on SGSE-ECC
Technical Field
The invention belongs to the field of log data processing, and relates to a log anomaly detection system based on SGSE-ECC.
Background
With the development of internet technology, the number of logs generated by each device in an information system is increasing, and analyzing logs generated by different devices and with different data characteristics is an important component of operation and maintenance work. The log data of the multisource isomerism are analyzed through an automatic means, the running state of the information system can be timely known to be abnormal or normal, the safe and stable running of the information system is ensured, and the operation and maintenance cost of enterprises is further reduced.
In the current technical method of multi-source heterogeneous log analysis, methods such as single analysis re-aggregation and correlation analysis are used. In the single analysis and aggregation method, a log generated in a single device in an information system is analyzed, the running state of each device is analyzed, and whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. However, when the method is used for analysis, the logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged separately, and the relationship among the logs of the different devices cannot be mined. In the correlation analysis method, characteristic events are generated according to the contents of all fields in the log, the events generated by different devices under a time window are clustered, then similarity comparison is carried out, and similar events are removed. Then, similar events of different devices are combined, and finally, statistical reports of various events are generated. However, in this method, for the purpose of generating statistical reports of various events, the relationship between various events cannot be deeply mined and directly presented to the user, and each event cannot be accurately classified only by using a clustering algorithm.
In the current technical method of multi-source heterogeneous log analysis, single analysis re-aggregation is used[1]Correlation analysis[2]And the like. In the single analysis and aggregation method, a log generated in a single device in an information system is analyzed, the running state of each device is analyzed, and whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. In the correlation analysis method, characteristic events are generated according to the contents of all fields in the log, similarity comparison is carried out on the events generated by different devices under a time window, and similar events are removed. Then, similar events of different devices are combined, and finally, statistical reports of various events are generated.
In the single analysis and aggregation method, logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be excavated. In the method of correlation analysis, aiming at generating the statistical report of various events, the relationship between various events cannot be deeply mined and directly presented to the user.
Disclosure of Invention
In order to solve the problem of log analysis, the invention provides the following technical scheme: an SGSE-ECC-based log anomaly detection system, comprising:
the time window dividing module is used for determining the size of a time window according to the requirement of the information system on response time;
the SGSE data processing module is used for forming the log data into sample data for being called by an ECC log analysis algorithm according to a time window;
the ECC model training module is used for training an ECC log analysis model;
and the ECC log analysis module is used for analyzing whether the state of the information system under the current time window is normal or not according to the log data of each device in the information system.
Further, the SGSE data processing module comprises
The SG state generation submodule counts the log number of each device to generate a log number state subsequence; counting the number of each log type generated on each device in a time window to generate a user behavior state subsequence; counting the number of times of occurrence of types in some important fields of each device in a time window to generate a field state subsequence;
the SE sequence extraction sub-module is used for sequentially extracting one feature in one subsequence as a label under the same time window, combining the feature with all the features of other two subsequences into a sample, enabling any feature in any subsequence to have all the features in other two subsequences corresponding to the feature, and randomly selecting one feature of each subsequence to form three labels to represent the time window to be detected when one time window is analyzed;
further, the ECC model training module comprises
The model training of the ECC model training submodule comprises the following steps:
step 1: counting the number of logs of multi-source heterogeneous log data in a normal time window and an abnormal time window to generate a log number state subsequence, counting the number of types of each log generated on each device in the time window to generate a user behavior state subsequence, and counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence;
step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;
and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000021
Figure BDA0002663560660000022
v1=1-v2 (3)
f(tableα)=v1*M′tableα+v2*M″tableα+bias (4)
table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,
table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,
table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,
Mtableα' is a sample corresponding to a feature in the log quantity state subsequence as a label and a sample representing a feature in the log quantity state subsequence as a label in a normal time windowThe mean square error between the two signals,
Mtableα"is the mean square error between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;
f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;
and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U1Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P1Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000031
Figure BDA0002663560660000032
v3=1-v4 (7)
f(tableβ)=v3*M′tableβ+v4*M″tableβ+bias (8)
table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,
table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,
table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,
Mtableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,
Mtableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,
f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;
step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U2Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P2Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000041
Figure BDA0002663560660000042
v5=1-v6 (11)
f(tableγ)=v5*M′tableγ+v6*M″tableγ+bias (12)
table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,
table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,
table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,
Mtableγ' is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the normal time window,
Mtableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,
f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;
and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U3Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P3Obtaining the device of the log quantity state subsequence under the normal time windowThe signal interval σ (γ) is
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
Further, the log analysis module comprises the following model analysis steps:
step 1, when a detected time window is analyzed, randomly selecting a feature of each subsequence to form three labels and forming three samples to represent the detected time window;
step 2: setting the initial values v 1-v 2, v 3-v 4 and v 5-v 6, and respectively calculating whether the difference values f (table α), f (table β) and f (table γ) of the three samples in the detected time window are within corresponding confidence intervals σ (α), σ (β) and σ (γ) through formulas (1) to (12);
and step 3: if all three samples are within the confidence interval, the constraint formulas for v1, v2, v3, v4, v5 and v6 are as follows
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the variation coefficients of the mean square error of the normal time window, and v1, v3 and v5 are reduced, so that the influence of the normal time window on the difference value is reduced, the influence of the abnormal time window on the difference value is increased, whether the difference values of the three samples under the tested time window are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is respectively calculated according to the new v1, v2, v3, v4, v5 and v6 and the three samples through the formulas (1) to (12), and the times of repeated constraint are determined according to the missing report rate requirement;
and 4, step 4: and if the difference values of the three samples under the tested time window are still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.
Has the advantages that: the invention provides an SGSE algorithm for processing multi-source heterogeneous logs into multi-dimensional samples capable of representing the state of an information system for algorithm analysis. The invention provides an ECC algorithm for multi-source heterogeneous logs, which can analyze multi-dimensional samples and analyze the running state of an information system of the time window according to the multi-source heterogeneous logs under the time window.
Drawings
FIG. 1: and (4) a multi-source heterogeneous log analysis flow chart.
FIG. 2: the measured window status samples generate a graph.
Detailed Description
The invention provides a method for processing multi-source heterogeneous log data in an information system by using an SGSE (State Generation Sequential Extraction) algorithm, and provides a new ECC (Error Coefficient Constraint) algorithm for judging an operation State in the information system aiming at the characteristics of the multi-source heterogeneous log of the information system. Referring to fig. 1, the multi-source heterogeneous log analysis method of the present invention includes the following steps:
step 1: the size of the time window is determined according to the response time required by the information system.
Step 2: and processing the logs in each time window by using an SGSE algorithm, and processing the log data in each time window into samples.
And step 3: and training and analyzing the time window needing to be analyzed by using an ECC log analysis model.
And 4, step 4: and presenting the log analysis result.
In a preferred embodiment, the steps of the ECC log analysis model training are as follows:
step 1: counting the number of logs of multi-source heterogeneous log data in a normal time window and an abnormal time window to generate a log number state subsequence, counting the number of types of each log generated on each device in the time window to generate a user behavior state subsequence, and counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence;
step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;
and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000061
Figure BDA0002663560660000062
v1=1-v2 (3)
f(tableα)=v1*M′tableα+v2*M″tableα+bias (4)
table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,
table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,
table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,
Mtableα' is the mean square error between the sample corresponding to a feature in the log quantity state subsequence as a label and the sample corresponding to a feature in the log quantity state subsequence as a label in the normal time window,
Mtableα"is the mean square error between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;
f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;
and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U1Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P1Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000071
Figure BDA0002663560660000072
v3=1-v4 (7)
f(tableβ)=v3*M′tableβ+v4*M″tableβ+bias (8)
table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,
table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,
table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,
Mtableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,
Mtableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,
f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;
step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U2Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P2Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000081
Figure BDA0002663560660000082
v5=1-v6 (11)
f(tableγ)=v5*M′tableγ+v6*M″tableγ+bias (12)
table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,
table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,
table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,
Mtableγ' is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the normal time window,
Mtableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,
f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;
and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U3Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P3Obtaining a confidence interval sigma (gamma) of the log quantity state subsequence under the normal time window as
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
In a preferred embodiment, the log analysis method comprises the following steps:
step 1, when a detected time window is analyzed, randomly selecting a feature of each subsequence to form three labels and forming three samples to represent the detected time window;
step 2: setting the initial values v 1-v 2, v 3-v 4 and v 5-v 6, and respectively calculating whether the difference values f (table α), f (table β) and f (table γ) of the three samples in the detected time window are within corresponding confidence intervals σ (α), σ (β) and σ (γ) through formulas (1) to (12);
and step 3: if all three samples are within the confidence interval, the constraint formulas for v1, v2, v3, v4, v5 and v6 are as follows
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the variation coefficients of the mean square error of the normal time window, and v1, v3 and v5 are reduced, so that the influence of the normal time window on the difference value is reduced, the influence of the abnormal time window on the difference value is increased, whether the difference values of the three samples under the tested time window are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is respectively calculated according to the new v1, v2, v3, v4, v5 and v6 and the three samples through the formulas (1) to (12), and the times of repeated constraint are determined according to the missing report rate requirement;
and 4, step 4: and if the difference values of the three samples under the tested time window are still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.
The invention also provides a log anomaly detection system, which comprises:
and the time window dividing module is used for determining the size of the time window according to the requirement of the information system on the response time.
And the SGSE data processing module is used for processing the log data and processing the log data into sample data which can be called by the ECC log analysis model according to the time window.
And the ECC model training module is used for training the ECC log analysis model.
And the ECC log analysis module is used for judging whether the time window to be tested is normal according to the time window to be tested analyzed by the ECC log analysis model, and analyzing whether the state of the information system under the time window is normal according to the logs of all the equipment in the information system.
In one scheme, the time window dividing module determines the size of the time window as short as possible according to the response time required by the user for the information system.
In one scheme, the SGSE data processing module is divided into an SG state generation submodule and an SE sequence extraction submodule.
And the SG state generation submodule determines various devices in the information system, such as WAF, load balance, firewall and the like. And counting the log number of each device to generate a log number state subsequence, wherein the subsequence comprises the characteristics of the WAF device log number, the load balancing device log number and the like.
Table 1: log quantity status subsequence
Time window/device type WAF Log quantity Load balancing log number Firewall log number …… Number of Nginx logs
Time window N α1 α2 α3 …… αn
Counting the number of each log category generated on each device in a time window to obtain a user behavior state subsequence, wherein the category determination is obtained by combining the types of each field with the entropy value not being 0, for example, an action field exists in the WAF device and is used for recording the behavior of the WAF device on access, and the types in the fields include alert and block; there are four types 200, 404, 500, 501 in the http _ method field for recording http status, and 2 × 4 different kinds of logs can be generated by the WAF device from the two fields. The subsequence contains a plurality of characteristics such as the number of WAF log types and the number of firewall log types.
Table 2: user behavior state subsequence
Figure BDA0002663560660000101
And counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence. The field state subsequence contains the number of action field alert types in the WAF, the number ratio of firewall protocol field TCP protocols and other characteristics.
Table 3: field state subsequence
Figure BDA0002663560660000102
The SE sequential extraction sub-module sequentially extracts a feature in one sub-sequence as a tag in the same time window, and combines the tag with all features of the other two sub-sequences to form a sample, as shown in the following table:
table 4: merging table for characteristic of WAF log quantity as tag in log quantity state subsequence
Figure BDA0002663560660000103
Table 5: user behavior state subsequence WAF log type 1 as label feature merging table
Figure BDA0002663560660000111
Table 6: field state subsequence WAF action field alert number as tag feature merge table
Figure BDA0002663560660000112
By the SE sequential extraction sub-module, any feature in any one sub-sequence can have all features in the other two sub-sequences corresponding to it. An association is made between the generated sample features. When a time window is analyzed, one feature of each subsequence is randomly selected to form three labels representing the time window to be measured.
In one scheme, the ECC model training module is divided into a small number of normal and abnormal event window determination sub-modules and an ECC model training sub-module.
And the small number of normal and abnormal event windows determine a submodule. The classification can not be accurately carried out only by using a clustering algorithm, and the problem of inaccurate classification during the training of an analysis model is amplified to influence the analysis result. A small number of time windows can be accurately determined to be normal or abnormal time windows in the existing information system historical data according to professional knowledge.
The ECC model training submodule comprises the following steps of:
step 1: processing multi-source heterogeneous log data in normal and abnormal time windows through an SG state generation sub-module to obtain log quantity statistics generation log quantity state sub-sequences of the log data, performing quantity statistics on the number of types of each log generated on each device in the time windows to generate user behavior state sub-sequences, and performing quantity statistics on the number of times of types appearing in some important fields of each device in the time windows to generate field state sub-sequences.
Step 2: and (2) extracting the n characteristics in the log quantity state subsequence, the m characteristics in the user behavior state subsequence and the j characteristics in the field state subsequence in the time window according to the SE sequence to generate (n + m + j) sample data sets.
Step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;
and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000121
Figure BDA0002663560660000122
v1=1-v2 (3)
f(tableα)=v1*M′tableα+v2*M″tableα+bias (4)
table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,
table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,
table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,
Mtableα' is the mean square error between the sample corresponding to a feature in the log quantity state subsequence as a label and the sample corresponding to a feature in the log quantity state subsequence as a label in the normal time window,
Mtableα"is a sample corresponding to a feature in the log quantity status subsequence as a label and represents a feature in the log quantity status subsequence in the abnormal time windowIs the mean square error between the samples corresponding to the tag,
the bias is the bias to be executed,
v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;
f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;
and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U1Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P1Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000123
Figure BDA0002663560660000124
v3=1-v4 (7)
f(tableβ)=v3*M′tableβ+v4*M″tableβ+bias (8)
table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,
table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,
table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,
Mtableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,
Mtableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,
f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;
step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U2Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P2Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure BDA0002663560660000131
Figure BDA0002663560660000132
v5=1-v6 (11)
f(tableγ)=v5*M′tableγ+v6*M″tableγ+bias (12)
table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,
table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,
table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,
Mtableγ' is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the normal time window,
Mtableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,
f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;
and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U3Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P3Obtaining a confidence interval sigma (gamma) of the log quantity state subsequence under the normal time window as
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
In one aspect, the ECC log analysis module comprises the following steps:
step 1, when a time window is analyzed for a detected window, one feature of each subsequence is randomly selected as three labels and three samples representing the detected time window are formed, wherein the samples are shown in figure 2.
Step 2: assuming that the initial value v1 is v2, v3 is v4, and v5 is v6, the differences f (table α), f (table β), and f (table γ) of the three samples in the test window are respectively calculated by the formulas (1), (2), (3), and (4) of the ECC model training submodule to determine whether the differences f (table α), f (table β), and f (table γ) are within the corresponding confidence intervals σ (α), σ (β), and σ (γ).
And step 3: if the three samples are all in the confidence interval, the detection accuracy is improved by constraining v1, v2, v3, v4, v5 and v6, and the constraint formula is
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are changes of the normal time window respectively, and if v1, v3 and v5 are reduced, the influence of the normal time window on the difference value is reduced, and the influence of the abnormal event window on the difference value is increased. And the obtained new v1, v2, v3, v4, v5 and v6 and the three samples are re-introduced into an ECC model training submodule, and the formulas (1) to (12) are used for checking whether the difference values are in a confidence interval, determining the repeated constraint times according to the preset missing report rate requirement, and if the missing report rate requirement is low, the constraint times are few. And if the missing report rate requirement is low, the constraint times are large.
And 4, step 4: and if the difference value of the time window is still in the confidence interval after the repeated restriction is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.
In the single analysis and aggregation method, logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be excavated. The SGSE algorithm provided by the invention can effectively process logs among different devices and aggregate the logs into a sample capable of reflecting the state of the information system in the time window, and comprehensively judges the overall state of the information system, rather than aggregating after analyzing a single device to obtain a result.
In the method of correlation analysis, aiming at generating the statistical report of various events, the relationship between various events cannot be deeply mined and directly presented to the user. Deep analysis is carried out on the multi-source heterogeneous logs through an SGSE-ECC algorithm, not only are statistical reports of various events generated through clustering, but also the relationship among the deep mining logs visually presents the state of an information system to a user.
Generally, the SGSE-ECC algorithm model is used for carrying out data processing, sample generation and state analysis on the multi-source heterogeneous log, the log is automatically analyzed, and the operation and maintenance cost is reduced.
The SGSE algorithm provided by the invention processes log data on each device, generates samples capable of reflecting the time window state in an aggregation mode, provides an ECC algorithm to analyze the multidimensional samples, and adjusts the detection precision and the missing report rate through restricting the change coefficient.

Claims (4)

1. An SGSE-ECC-based log anomaly detection system, comprising:
the time window dividing module is used for determining the size of a time window according to the requirement of the information system on response time;
the SGSE data processing module is used for forming the log data into sample data for being called by an ECC log analysis algorithm according to a time window;
the ECC model training module is used for training an ECC log analysis model;
and the ECC log analysis module is used for analyzing whether the state of the information system under the current time window is normal or not according to the log data of each device in the information system.
2. The SGSE-ECC-based log anomaly detection system according to claim 1, wherein said SGSE data processing module comprises
The SG state generation submodule counts the log number of each device to generate a log number state subsequence; counting the number of each log type generated on each device in a time window to generate a user behavior state subsequence; counting the number of times of occurrence of types in some important fields of each device in a time window to generate a field state subsequence;
and the SE sequence extraction sub-module is used for sequentially extracting one feature in one subsequence as a label in the same time window, combining the feature with all the features of the other two subsequences into a sample, enabling any feature in any one subsequence to correspond to all the features in the other two subsequences, and randomly selecting one feature of each subsequence to form three labels to represent the time window to be detected when one time window is analyzed.
3. The SGSE-ECC-based log anomaly detection system of claim 1, wherein the ECC model training module comprises
The model training of the ECC model training submodule comprises the following steps:
step 1: counting the number of logs of multi-source heterogeneous log data in a normal time window and an abnormal time window to generate a log number state subsequence, counting the number of types of each log generated on each device in the time window to generate a user behavior state subsequence, and counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence;
step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;
and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure FDA0002663560650000021
Figure FDA0002663560650000022
v1=1-v2 (3)
f(tableα)=v1*M′tableα+v2*M″tableα+bias (4)
table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,
table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,
table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,
Mtableα' is the mean square error between the sample corresponding to a feature in the log quantity state subsequence as a label and the sample corresponding to a feature in the log quantity state subsequence as a label in the normal time window,
Mtableα"is the mean square error between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;
f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;
and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U1Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P1Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure FDA0002663560650000023
Figure FDA0002663560650000024
v3=1-v4 (7)
f(tableβ)=v3*M′tableβ+v4*M″tableβ+bias (8)
table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,
table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,
table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,
Mtableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,
Mtableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,
f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;
step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U2Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P2Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
Figure FDA0002663560650000031
Figure FDA0002663560650000032
v5=1-v6 (11)
f(tableγ)=v5*M′tableγ+v6*M″tableγ+bias (12)
table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,
table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,
table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,
Mtableγ' is field state subThe mean square error between the sample corresponding to a certain feature in the sequence as a label and the sample corresponding to a certain feature in the field state subsequence under the normal time window as a label,
Mtableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,
f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;
and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U3Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P3Obtaining a confidence interval sigma (gamma) of the log quantity state subsequence under the normal time window as
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
4. The SGSE-ECC-based log anomaly detection system according to claim 1, wherein the log analysis module performs a model analysis by:
step 1: when a detected time window is analyzed, randomly selecting one feature of each subsequence to form three labels and forming three samples to represent the detected time window;
step 2: setting the initial values v 1-v 2, v 3-v 4 and v 5-v 6, and respectively calculating whether the difference values f (table α), f (table β) and f (table γ) of the three samples in the detected time window are within corresponding confidence intervals σ (α), σ (β) and σ (γ) through formulas (1) to (12);
and step 3: if all three samples are within the confidence interval, the constraint formulas for v1, v2, v3, v4, v5 and v6 are as follows
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the variation coefficients of the mean square error of the normal time window, and v1, v3 and v5 are reduced, so that the influence of the normal time window on the difference value is reduced, the influence of the abnormal time window on the difference value is increased, whether the difference values of the three samples under the tested time window are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is respectively calculated according to the new v1, v2, v3, v4, v5 and v6 and the three samples through the formulas (1) to (12), and the times of repeated constraint are determined according to the missing report rate requirement;
and 4, step 4: and if the difference values of the three samples under the tested time window are still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.
CN202010911782.XA 2020-09-02 2020-09-02 Log anomaly detection system based on SGSE-ECC Active CN111984516B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010911782.XA CN111984516B (en) 2020-09-02 2020-09-02 Log anomaly detection system based on SGSE-ECC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010911782.XA CN111984516B (en) 2020-09-02 2020-09-02 Log anomaly detection system based on SGSE-ECC

Publications (2)

Publication Number Publication Date
CN111984516A true CN111984516A (en) 2020-11-24
CN111984516B CN111984516B (en) 2024-01-05

Family

ID=73448654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010911782.XA Active CN111984516B (en) 2020-09-02 2020-09-02 Log anomaly detection system based on SGSE-ECC

Country Status (1)

Country Link
CN (1) CN111984516B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065440A (en) * 2009-09-17 2011-03-31 Mitsubishi Denki Information Technology Corp Log data analysis device and log data analysis method of the same, and log data analysis program
CN103546312A (en) * 2013-08-27 2014-01-29 中国航天科工集团第二研究院七〇六所 Massive multi-source isomerism log correlation analyzing method
CN105653427A (en) * 2016-03-04 2016-06-08 上海交通大学 Log monitoring method based on abnormal behavior detection
CN109714187A (en) * 2018-08-17 2019-05-03 平安普惠企业管理有限公司 Log analysis method, device, equipment and storage medium based on machine learning
KR20190090126A (en) * 2018-01-24 2019-08-01 주식회사 오픈시스넷 Vritualization error checking method of target system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065440A (en) * 2009-09-17 2011-03-31 Mitsubishi Denki Information Technology Corp Log data analysis device and log data analysis method of the same, and log data analysis program
CN103546312A (en) * 2013-08-27 2014-01-29 中国航天科工集团第二研究院七〇六所 Massive multi-source isomerism log correlation analyzing method
CN105653427A (en) * 2016-03-04 2016-06-08 上海交通大学 Log monitoring method based on abnormal behavior detection
KR20190090126A (en) * 2018-01-24 2019-08-01 주식회사 오픈시스넷 Vritualization error checking method of target system
CN109714187A (en) * 2018-08-17 2019-05-03 平安普惠企业管理有限公司 Log analysis method, device, equipment and storage medium based on machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任明;宋云奎;: "基于深度学习的云计算系统异常检测方法", 计算机技术与发展, no. 05 *

Also Published As

Publication number Publication date
CN111984516B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
US20050283337A1 (en) System and method for correlation of time-series data
WO2020038353A1 (en) Abnormal behavior detection method and system
CN114024837B (en) Fault root cause positioning method of micro-service system
US8078913B2 (en) Automated identification of performance crisis
CN109934268B (en) Abnormal transaction detection method and system
US20170140309A1 (en) Database analysis device and database analysis method
US20180349801A1 (en) Log analysis system, log analysis method, and log analysis program
Sullivan et al. Step-down analysis for changes in the covariance matrix and other parameters
US20160255109A1 (en) Detection method and apparatus
CN104216349A (en) Yield analysis system and method using sensor data of fabrication equipment
Khan et al. Guidelines for assessing the accuracy of log message template identification techniques
CN108491991B (en) Constraint condition analysis system and method based on industrial big data product construction period
CN111796957A (en) Transaction abnormal root cause analysis method and system based on application log
CN113535454A (en) Method and device for detecting log data abnormity
CN115576738A (en) Method and system for realizing equipment fault determination based on chip analysis
CA3186873A1 (en) Activity level measurement using deep learning and machine learning
CN114757468A (en) Root cause analysis method for flow execution abnormity in flow mining
CN111984515B (en) Multi-source heterogeneous log analysis method
CN109145764B (en) Method and device for identifying unaligned sections of multiple groups of detection waveforms of comprehensive detection vehicle
CN114611604A (en) User screening method based on electric drive assembly load characteristic fusion and clustering
CN112215307B (en) Method for automatically detecting signal abnormality of earthquake instrument by machine learning
WO2022047659A1 (en) Multi-source heterogeneous log analysis method
CN110852860A (en) Vehicle maintenance reimbursement behavior abnormity detection method, equipment and storage medium
CN111984516B (en) Log anomaly detection system based on SGSE-ECC
CN114756420A (en) Fault prediction method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant