CN111984516A - Log anomaly detection system based on SGSE-ECC - Google Patents
Log anomaly detection system based on SGSE-ECC Download PDFInfo
- Publication number
- CN111984516A CN111984516A CN202010911782.XA CN202010911782A CN111984516A CN 111984516 A CN111984516 A CN 111984516A CN 202010911782 A CN202010911782 A CN 202010911782A CN 111984516 A CN111984516 A CN 111984516A
- Authority
- CN
- China
- Prior art keywords
- time window
- log
- normal
- label
- subsequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 12
- 238000004458 analytical method Methods 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000012545 processing Methods 0.000 claims abstract description 17
- 230000004044 response Effects 0.000 claims abstract description 6
- 230000002159 abnormal effect Effects 0.000 claims description 114
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 6
- 238000000034 method Methods 0.000 description 17
- 238000004220 aggregation Methods 0.000 description 7
- 238000010219 correlation analysis Methods 0.000 description 6
- 230000002776 aggregation Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000012423 maintenance Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3452—Performance evaluation by statistical analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
Abstract
The log anomaly detection system based on the SGSE-ECC belongs to the field of log data processing, and aims to solve the problem of log analysis, the time window division module is used for determining the size of a time window according to the requirement of an information system on response time; the SGSE data processing module is used for forming the log data into sample data for being called by an ECC log analysis algorithm according to a time window; the ECC model training module is used for training an ECC log analysis model; and the ECC log analysis module is used for analyzing whether the state of the information system under the current time window is normal or not according to the log data of each device in the information system, and the effect is that the log can be analyzed abnormally.
Description
Technical Field
The invention belongs to the field of log data processing, and relates to a log anomaly detection system based on SGSE-ECC.
Background
With the development of internet technology, the number of logs generated by each device in an information system is increasing, and analyzing logs generated by different devices and with different data characteristics is an important component of operation and maintenance work. The log data of the multisource isomerism are analyzed through an automatic means, the running state of the information system can be timely known to be abnormal or normal, the safe and stable running of the information system is ensured, and the operation and maintenance cost of enterprises is further reduced.
In the current technical method of multi-source heterogeneous log analysis, methods such as single analysis re-aggregation and correlation analysis are used. In the single analysis and aggregation method, a log generated in a single device in an information system is analyzed, the running state of each device is analyzed, and whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. However, when the method is used for analysis, the logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged separately, and the relationship among the logs of the different devices cannot be mined. In the correlation analysis method, characteristic events are generated according to the contents of all fields in the log, the events generated by different devices under a time window are clustered, then similarity comparison is carried out, and similar events are removed. Then, similar events of different devices are combined, and finally, statistical reports of various events are generated. However, in this method, for the purpose of generating statistical reports of various events, the relationship between various events cannot be deeply mined and directly presented to the user, and each event cannot be accurately classified only by using a clustering algorithm.
In the current technical method of multi-source heterogeneous log analysis, single analysis re-aggregation is used[1]Correlation analysis[2]And the like. In the single analysis and aggregation method, a log generated in a single device in an information system is analyzed, the running state of each device is analyzed, and whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. In the correlation analysis method, characteristic events are generated according to the contents of all fields in the log, similarity comparison is carried out on the events generated by different devices under a time window, and similar events are removed. Then, similar events of different devices are combined, and finally, statistical reports of various events are generated.
In the single analysis and aggregation method, logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be excavated. In the method of correlation analysis, aiming at generating the statistical report of various events, the relationship between various events cannot be deeply mined and directly presented to the user.
Disclosure of Invention
In order to solve the problem of log analysis, the invention provides the following technical scheme: an SGSE-ECC-based log anomaly detection system, comprising:
the time window dividing module is used for determining the size of a time window according to the requirement of the information system on response time;
the SGSE data processing module is used for forming the log data into sample data for being called by an ECC log analysis algorithm according to a time window;
the ECC model training module is used for training an ECC log analysis model;
and the ECC log analysis module is used for analyzing whether the state of the information system under the current time window is normal or not according to the log data of each device in the information system.
Further, the SGSE data processing module comprises
The SG state generation submodule counts the log number of each device to generate a log number state subsequence; counting the number of each log type generated on each device in a time window to generate a user behavior state subsequence; counting the number of times of occurrence of types in some important fields of each device in a time window to generate a field state subsequence;
the SE sequence extraction sub-module is used for sequentially extracting one feature in one subsequence as a label under the same time window, combining the feature with all the features of other two subsequences into a sample, enabling any feature in any subsequence to have all the features in other two subsequences corresponding to the feature, and randomly selecting one feature of each subsequence to form three labels to represent the time window to be detected when one time window is analyzed;
further, the ECC model training module comprises
The model training of the ECC model training submodule comprises the following steps:
step 1: counting the number of logs of multi-source heterogeneous log data in a normal time window and an abnormal time window to generate a log number state subsequence, counting the number of types of each log generated on each device in the time window to generate a user behavior state subsequence, and counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence;
step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;
and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v1=1-v2 (3)
f(tableα)=v1*M′tableα+v2*M″tableα+bias (4)
table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,
table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,
table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,
Mtableα' is a sample corresponding to a feature in the log quantity state subsequence as a label and a sample representing a feature in the log quantity state subsequence as a label in a normal time windowThe mean square error between the two signals,
Mtableα"is the mean square error between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;
f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;
and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U1Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P1Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v3=1-v4 (7)
f(tableβ)=v3*M′tableβ+v4*M″tableβ+bias (8)
table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,
table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,
table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,
Mtableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,
Mtableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,
f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;
step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U2Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P2Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v5=1-v6 (11)
f(tableγ)=v5*M′tableγ+v6*M″tableγ+bias (12)
table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,
table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,
table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,
Mtableγ' is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the normal time window,
Mtableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,
f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;
and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U3Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P3Obtaining the device of the log quantity state subsequence under the normal time windowThe signal interval σ (γ) is
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
Further, the log analysis module comprises the following model analysis steps:
step 2: setting the initial values v 1-v 2, v 3-v 4 and v 5-v 6, and respectively calculating whether the difference values f (table α), f (table β) and f (table γ) of the three samples in the detected time window are within corresponding confidence intervals σ (α), σ (β) and σ (γ) through formulas (1) to (12);
and step 3: if all three samples are within the confidence interval, the constraint formulas for v1, v2, v3, v4, v5 and v6 are as follows
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the variation coefficients of the mean square error of the normal time window, and v1, v3 and v5 are reduced, so that the influence of the normal time window on the difference value is reduced, the influence of the abnormal time window on the difference value is increased, whether the difference values of the three samples under the tested time window are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is respectively calculated according to the new v1, v2, v3, v4, v5 and v6 and the three samples through the formulas (1) to (12), and the times of repeated constraint are determined according to the missing report rate requirement;
and 4, step 4: and if the difference values of the three samples under the tested time window are still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.
Has the advantages that: the invention provides an SGSE algorithm for processing multi-source heterogeneous logs into multi-dimensional samples capable of representing the state of an information system for algorithm analysis. The invention provides an ECC algorithm for multi-source heterogeneous logs, which can analyze multi-dimensional samples and analyze the running state of an information system of the time window according to the multi-source heterogeneous logs under the time window.
Drawings
FIG. 1: and (4) a multi-source heterogeneous log analysis flow chart.
FIG. 2: the measured window status samples generate a graph.
Detailed Description
The invention provides a method for processing multi-source heterogeneous log data in an information system by using an SGSE (State Generation Sequential Extraction) algorithm, and provides a new ECC (Error Coefficient Constraint) algorithm for judging an operation State in the information system aiming at the characteristics of the multi-source heterogeneous log of the information system. Referring to fig. 1, the multi-source heterogeneous log analysis method of the present invention includes the following steps:
step 1: the size of the time window is determined according to the response time required by the information system.
Step 2: and processing the logs in each time window by using an SGSE algorithm, and processing the log data in each time window into samples.
And step 3: and training and analyzing the time window needing to be analyzed by using an ECC log analysis model.
And 4, step 4: and presenting the log analysis result.
In a preferred embodiment, the steps of the ECC log analysis model training are as follows:
step 1: counting the number of logs of multi-source heterogeneous log data in a normal time window and an abnormal time window to generate a log number state subsequence, counting the number of types of each log generated on each device in the time window to generate a user behavior state subsequence, and counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence;
step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;
and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v1=1-v2 (3)
f(tableα)=v1*M′tableα+v2*M″tableα+bias (4)
table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,
table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,
table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,
Mtableα' is the mean square error between the sample corresponding to a feature in the log quantity state subsequence as a label and the sample corresponding to a feature in the log quantity state subsequence as a label in the normal time window,
Mtableα"is the mean square error between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;
f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;
and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U1Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P1Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v3=1-v4 (7)
f(tableβ)=v3*M′tableβ+v4*M″tableβ+bias (8)
table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,
table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,
table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,
Mtableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,
Mtableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,
f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;
step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U2Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P2Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v5=1-v6 (11)
f(tableγ)=v5*M′tableγ+v6*M″tableγ+bias (12)
table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,
table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,
table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,
Mtableγ' is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the normal time window,
Mtableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,
f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;
and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U3Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P3Obtaining a confidence interval sigma (gamma) of the log quantity state subsequence under the normal time window as
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
In a preferred embodiment, the log analysis method comprises the following steps:
step 2: setting the initial values v 1-v 2, v 3-v 4 and v 5-v 6, and respectively calculating whether the difference values f (table α), f (table β) and f (table γ) of the three samples in the detected time window are within corresponding confidence intervals σ (α), σ (β) and σ (γ) through formulas (1) to (12);
and step 3: if all three samples are within the confidence interval, the constraint formulas for v1, v2, v3, v4, v5 and v6 are as follows
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the variation coefficients of the mean square error of the normal time window, and v1, v3 and v5 are reduced, so that the influence of the normal time window on the difference value is reduced, the influence of the abnormal time window on the difference value is increased, whether the difference values of the three samples under the tested time window are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is respectively calculated according to the new v1, v2, v3, v4, v5 and v6 and the three samples through the formulas (1) to (12), and the times of repeated constraint are determined according to the missing report rate requirement;
and 4, step 4: and if the difference values of the three samples under the tested time window are still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.
The invention also provides a log anomaly detection system, which comprises:
and the time window dividing module is used for determining the size of the time window according to the requirement of the information system on the response time.
And the SGSE data processing module is used for processing the log data and processing the log data into sample data which can be called by the ECC log analysis model according to the time window.
And the ECC model training module is used for training the ECC log analysis model.
And the ECC log analysis module is used for judging whether the time window to be tested is normal according to the time window to be tested analyzed by the ECC log analysis model, and analyzing whether the state of the information system under the time window is normal according to the logs of all the equipment in the information system.
In one scheme, the time window dividing module determines the size of the time window as short as possible according to the response time required by the user for the information system.
In one scheme, the SGSE data processing module is divided into an SG state generation submodule and an SE sequence extraction submodule.
And the SG state generation submodule determines various devices in the information system, such as WAF, load balance, firewall and the like. And counting the log number of each device to generate a log number state subsequence, wherein the subsequence comprises the characteristics of the WAF device log number, the load balancing device log number and the like.
Table 1: log quantity status subsequence
Time window/device type | WAF Log quantity | Load balancing log number | Firewall log number | …… | Number of Nginx logs |
Time window N | α1 | α2 | α3 | …… | αn |
Counting the number of each log category generated on each device in a time window to obtain a user behavior state subsequence, wherein the category determination is obtained by combining the types of each field with the entropy value not being 0, for example, an action field exists in the WAF device and is used for recording the behavior of the WAF device on access, and the types in the fields include alert and block; there are four types 200, 404, 500, 501 in the http _ method field for recording http status, and 2 × 4 different kinds of logs can be generated by the WAF device from the two fields. The subsequence contains a plurality of characteristics such as the number of WAF log types and the number of firewall log types.
Table 2: user behavior state subsequence
And counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence. The field state subsequence contains the number of action field alert types in the WAF, the number ratio of firewall protocol field TCP protocols and other characteristics.
Table 3: field state subsequence
The SE sequential extraction sub-module sequentially extracts a feature in one sub-sequence as a tag in the same time window, and combines the tag with all features of the other two sub-sequences to form a sample, as shown in the following table:
table 4: merging table for characteristic of WAF log quantity as tag in log quantity state subsequence
Table 5: user behavior state subsequence WAF log type 1 as label feature merging table
Table 6: field state subsequence WAF action field alert number as tag feature merge table
By the SE sequential extraction sub-module, any feature in any one sub-sequence can have all features in the other two sub-sequences corresponding to it. An association is made between the generated sample features. When a time window is analyzed, one feature of each subsequence is randomly selected to form three labels representing the time window to be measured.
In one scheme, the ECC model training module is divided into a small number of normal and abnormal event window determination sub-modules and an ECC model training sub-module.
And the small number of normal and abnormal event windows determine a submodule. The classification can not be accurately carried out only by using a clustering algorithm, and the problem of inaccurate classification during the training of an analysis model is amplified to influence the analysis result. A small number of time windows can be accurately determined to be normal or abnormal time windows in the existing information system historical data according to professional knowledge.
The ECC model training submodule comprises the following steps of:
step 1: processing multi-source heterogeneous log data in normal and abnormal time windows through an SG state generation sub-module to obtain log quantity statistics generation log quantity state sub-sequences of the log data, performing quantity statistics on the number of types of each log generated on each device in the time windows to generate user behavior state sub-sequences, and performing quantity statistics on the number of times of types appearing in some important fields of each device in the time windows to generate field state sub-sequences.
Step 2: and (2) extracting the n characteristics in the log quantity state subsequence, the m characteristics in the user behavior state subsequence and the j characteristics in the field state subsequence in the time window according to the SE sequence to generate (n + m + j) sample data sets.
Step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;
and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v1=1-v2 (3)
f(tableα)=v1*M′tableα+v2*M″tableα+bias (4)
table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,
table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,
table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,
Mtableα' is the mean square error between the sample corresponding to a feature in the log quantity state subsequence as a label and the sample corresponding to a feature in the log quantity state subsequence as a label in the normal time window,
Mtableα"is a sample corresponding to a feature in the log quantity status subsequence as a label and represents a feature in the log quantity status subsequence in the abnormal time windowIs the mean square error between the samples corresponding to the tag,
the bias is the bias to be executed,
v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;
f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;
and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U1Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P1Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v3=1-v4 (7)
f(tableβ)=v3*M′tableβ+v4*M″tableβ+bias (8)
table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,
table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,
table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,
Mtableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,
Mtableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,
f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;
step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U2Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P2Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v5=1-v6 (11)
f(tableγ)=v5*M′tableγ+v6*M″tableγ+bias (12)
table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,
table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,
table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,
Mtableγ' is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the normal time window,
Mtableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,
f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;
and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U3Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P3Obtaining a confidence interval sigma (gamma) of the log quantity state subsequence under the normal time window as
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
In one aspect, the ECC log analysis module comprises the following steps:
Step 2: assuming that the initial value v1 is v2, v3 is v4, and v5 is v6, the differences f (table α), f (table β), and f (table γ) of the three samples in the test window are respectively calculated by the formulas (1), (2), (3), and (4) of the ECC model training submodule to determine whether the differences f (table α), f (table β), and f (table γ) are within the corresponding confidence intervals σ (α), σ (β), and σ (γ).
And step 3: if the three samples are all in the confidence interval, the detection accuracy is improved by constraining v1, v2, v3, v4, v5 and v6, and the constraint formula is
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are changes of the normal time window respectively, and if v1, v3 and v5 are reduced, the influence of the normal time window on the difference value is reduced, and the influence of the abnormal event window on the difference value is increased. And the obtained new v1, v2, v3, v4, v5 and v6 and the three samples are re-introduced into an ECC model training submodule, and the formulas (1) to (12) are used for checking whether the difference values are in a confidence interval, determining the repeated constraint times according to the preset missing report rate requirement, and if the missing report rate requirement is low, the constraint times are few. And if the missing report rate requirement is low, the constraint times are large.
And 4, step 4: and if the difference value of the time window is still in the confidence interval after the repeated restriction is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.
In the single analysis and aggregation method, logs in different devices are not analyzed in a combined mode, but the logs are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be excavated. The SGSE algorithm provided by the invention can effectively process logs among different devices and aggregate the logs into a sample capable of reflecting the state of the information system in the time window, and comprehensively judges the overall state of the information system, rather than aggregating after analyzing a single device to obtain a result.
In the method of correlation analysis, aiming at generating the statistical report of various events, the relationship between various events cannot be deeply mined and directly presented to the user. Deep analysis is carried out on the multi-source heterogeneous logs through an SGSE-ECC algorithm, not only are statistical reports of various events generated through clustering, but also the relationship among the deep mining logs visually presents the state of an information system to a user.
Generally, the SGSE-ECC algorithm model is used for carrying out data processing, sample generation and state analysis on the multi-source heterogeneous log, the log is automatically analyzed, and the operation and maintenance cost is reduced.
The SGSE algorithm provided by the invention processes log data on each device, generates samples capable of reflecting the time window state in an aggregation mode, provides an ECC algorithm to analyze the multidimensional samples, and adjusts the detection precision and the missing report rate through restricting the change coefficient.
Claims (4)
1. An SGSE-ECC-based log anomaly detection system, comprising:
the time window dividing module is used for determining the size of a time window according to the requirement of the information system on response time;
the SGSE data processing module is used for forming the log data into sample data for being called by an ECC log analysis algorithm according to a time window;
the ECC model training module is used for training an ECC log analysis model;
and the ECC log analysis module is used for analyzing whether the state of the information system under the current time window is normal or not according to the log data of each device in the information system.
2. The SGSE-ECC-based log anomaly detection system according to claim 1, wherein said SGSE data processing module comprises
The SG state generation submodule counts the log number of each device to generate a log number state subsequence; counting the number of each log type generated on each device in a time window to generate a user behavior state subsequence; counting the number of times of occurrence of types in some important fields of each device in a time window to generate a field state subsequence;
and the SE sequence extraction sub-module is used for sequentially extracting one feature in one subsequence as a label in the same time window, combining the feature with all the features of the other two subsequences into a sample, enabling any feature in any one subsequence to correspond to all the features in the other two subsequences, and randomly selecting one feature of each subsequence to form three labels to represent the time window to be detected when one time window is analyzed.
3. The SGSE-ECC-based log anomaly detection system of claim 1, wherein the ECC model training module comprises
The model training of the ECC model training submodule comprises the following steps:
step 1: counting the number of logs of multi-source heterogeneous log data in a normal time window and an abnormal time window to generate a log number state subsequence, counting the number of types of each log generated on each device in the time window to generate a user behavior state subsequence, and counting the number of times of types appearing in some important fields of each device in the time window to generate a field state subsequence;
step 2: generating (n + m + j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence in each time window;
and step 3: taking a certain characteristic in the log quantity state subsequences in each normal and abnormal time window as a sample data set of a label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v1=1-v2 (3)
f(tableα)=v1*M′tableα+v2*M″tableα+bias (4)
table alpha represents a sample corresponding to a certain feature in the log quantity state subsequence as a label,
table α' represents a sample corresponding to a feature in the log quantity state subsequence in the normal time window as a tag,
table α "represents a sample corresponding to a certain feature in the log quantity state subsequence in the abnormal time window as a tag,
Mtableα' is the mean square error between the sample corresponding to a feature in the log quantity state subsequence as a label and the sample corresponding to a feature in the log quantity state subsequence as a label in the normal time window,
Mtableα"is the mean square error between the sample corresponding to a certain feature in the log quantity state subsequence as a label and the sample corresponding to a certain feature in the log quantity state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v1 and v2 are respectively the variation coefficients of mean square errors of a normal time window and an abnormal time window, and v1 is equal to v2 during training;
f (table α) is a difference value calculated by taking a certain characteristic in the log quantity state subsequence in the trained time window as a label;
and 4, step 4: calculating difference values of each normal time window and other normal and abnormal time windows through formulas (1) - (4) and storing the difference values as a set U1Calculating the difference value of the abnormal time window and the normal and abnormal time windows through the formulas (1) - (4) and storing the difference value as P1Obtaining a confidence interval sigma (alpha) of the log quantity state subsequence under a normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
And 5: taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with the sample data sets in other normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v3=1-v4 (7)
f(tableβ)=v3*M′tableβ+v4*M″tableβ+bias (8)
table beta represents a sample corresponding to a certain characteristic in the user behavior state subsequence as a label,
table beta' represents a sample corresponding to a certain characteristic in the user behavior state subsequence in the normal time window as a label,
table beta "represents a sample corresponding to a certain feature in the user behavior state subsequence in the abnormal time window as a label,
Mtableβ' is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the normal time window,
Mtableβ"is the mean square error between the sample corresponding to a certain feature in the user behavior state subsequence as a label and the sample corresponding to a certain feature in the user behavior state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v3 and v4 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v3 is equal to v4 during training,
f (table beta) is a difference value calculated by taking a certain characteristic in the user behavior state subsequence in the trained time window as a label;
step 6: calculating difference values of each normal time window and normal and abnormal time windows through formulas (5) - (8) and storing the difference values as a set U2Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (5) - (8) and storing the difference value as P2Obtaining a confidence interval sigma (beta) of the field state subsequence under a normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
And 7: taking a certain characteristic in the field state subsequence in each normal and abnormal time window as a sample data set of the label, and respectively calculating difference values with other sample sets in normal and abnormal time windows pairwise according to an ECC expression, wherein the calculation expression is as follows:
v5=1-v6 (11)
f(tableγ)=v5*M′tableγ+v6*M″tableγ+bias (12)
table gamma represents a sample corresponding to a certain feature in the field state subsequence as a label,
table γ' represents a sample corresponding to a certain feature in the field state subsequence in the normal time window as a tag,
table γ "represents a sample corresponding to a certain feature in the field state subsequence in the abnormal time window as a tag,
Mtableγ' is field state subThe mean square error between the sample corresponding to a certain feature in the sequence as a label and the sample corresponding to a certain feature in the field state subsequence under the normal time window as a label,
Mtableγ"is the mean square error between the sample corresponding to a certain feature in the field state subsequence as a label and the sample corresponding to a certain feature in the field state subsequence as a label in the abnormal time window,
the bias is the bias to be executed,
v5 and v6 are respectively the variation coefficients of the mean square error of a normal time window and an abnormal time window, v5 is equal to v6 during training,
f (table gamma) is a difference value calculated by taking a certain characteristic in a field state subsequence in a trained time window as a label;
and 8: calculating difference values of each normal time window and normal and abnormal time windows through formulas (9) - (12) and storing the difference values as a set U3Calculating the difference value between the abnormal time window and the normal and abnormal time windows through the formulas (9) - (12) and storing the difference value as P3Obtaining a confidence interval sigma (gamma) of the log quantity state subsequence under the normal time window as
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
4. The SGSE-ECC-based log anomaly detection system according to claim 1, wherein the log analysis module performs a model analysis by:
step 1: when a detected time window is analyzed, randomly selecting one feature of each subsequence to form three labels and forming three samples to represent the detected time window;
step 2: setting the initial values v 1-v 2, v 3-v 4 and v 5-v 6, and respectively calculating whether the difference values f (table α), f (table β) and f (table γ) of the three samples in the detected time window are within corresponding confidence intervals σ (α), σ (β) and σ (γ) through formulas (1) to (12);
and step 3: if all three samples are within the confidence interval, the constraint formulas for v1, v2, v3, v4, v5 and v6 are as follows
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the variation coefficients of the mean square error of the normal time window, and v1, v3 and v5 are reduced, so that the influence of the normal time window on the difference value is reduced, the influence of the abnormal time window on the difference value is increased, whether the difference values of the three samples under the tested time window are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is respectively calculated according to the new v1, v2, v3, v4, v5 and v6 and the three samples through the formulas (1) to (12), and the times of repeated constraint are determined according to the missing report rate requirement;
and 4, step 4: and if the difference values of the three samples under the tested time window are still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the tested time window, otherwise, the information system is considered to be abnormal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010911782.XA CN111984516B (en) | 2020-09-02 | 2020-09-02 | Log anomaly detection system based on SGSE-ECC |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010911782.XA CN111984516B (en) | 2020-09-02 | 2020-09-02 | Log anomaly detection system based on SGSE-ECC |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111984516A true CN111984516A (en) | 2020-11-24 |
CN111984516B CN111984516B (en) | 2024-01-05 |
Family
ID=73448654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010911782.XA Active CN111984516B (en) | 2020-09-02 | 2020-09-02 | Log anomaly detection system based on SGSE-ECC |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111984516B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011065440A (en) * | 2009-09-17 | 2011-03-31 | Mitsubishi Denki Information Technology Corp | Log data analysis device and log data analysis method of the same, and log data analysis program |
CN103546312A (en) * | 2013-08-27 | 2014-01-29 | 中国航天科工集团第二研究院七〇六所 | Massive multi-source isomerism log correlation analyzing method |
CN105653427A (en) * | 2016-03-04 | 2016-06-08 | 上海交通大学 | Log monitoring method based on abnormal behavior detection |
CN109714187A (en) * | 2018-08-17 | 2019-05-03 | 平安普惠企业管理有限公司 | Log analysis method, device, equipment and storage medium based on machine learning |
KR20190090126A (en) * | 2018-01-24 | 2019-08-01 | 주식회사 오픈시스넷 | Vritualization error checking method of target system |
-
2020
- 2020-09-02 CN CN202010911782.XA patent/CN111984516B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011065440A (en) * | 2009-09-17 | 2011-03-31 | Mitsubishi Denki Information Technology Corp | Log data analysis device and log data analysis method of the same, and log data analysis program |
CN103546312A (en) * | 2013-08-27 | 2014-01-29 | 中国航天科工集团第二研究院七〇六所 | Massive multi-source isomerism log correlation analyzing method |
CN105653427A (en) * | 2016-03-04 | 2016-06-08 | 上海交通大学 | Log monitoring method based on abnormal behavior detection |
KR20190090126A (en) * | 2018-01-24 | 2019-08-01 | 주식회사 오픈시스넷 | Vritualization error checking method of target system |
CN109714187A (en) * | 2018-08-17 | 2019-05-03 | 平安普惠企业管理有限公司 | Log analysis method, device, equipment and storage medium based on machine learning |
Non-Patent Citations (1)
Title |
---|
任明;宋云奎;: "基于深度学习的云计算系统异常检测方法", 计算机技术与发展, no. 05 * |
Also Published As
Publication number | Publication date |
---|---|
CN111984516B (en) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20050283337A1 (en) | System and method for correlation of time-series data | |
WO2020038353A1 (en) | Abnormal behavior detection method and system | |
CN114024837B (en) | Fault root cause positioning method of micro-service system | |
US8078913B2 (en) | Automated identification of performance crisis | |
CN109934268B (en) | Abnormal transaction detection method and system | |
US20170140309A1 (en) | Database analysis device and database analysis method | |
US20180349801A1 (en) | Log analysis system, log analysis method, and log analysis program | |
Sullivan et al. | Step-down analysis for changes in the covariance matrix and other parameters | |
US20160255109A1 (en) | Detection method and apparatus | |
CN104216349A (en) | Yield analysis system and method using sensor data of fabrication equipment | |
Khan et al. | Guidelines for assessing the accuracy of log message template identification techniques | |
CN108491991B (en) | Constraint condition analysis system and method based on industrial big data product construction period | |
CN111796957A (en) | Transaction abnormal root cause analysis method and system based on application log | |
CN113535454A (en) | Method and device for detecting log data abnormity | |
CN115576738A (en) | Method and system for realizing equipment fault determination based on chip analysis | |
CA3186873A1 (en) | Activity level measurement using deep learning and machine learning | |
CN114757468A (en) | Root cause analysis method for flow execution abnormity in flow mining | |
CN111984515B (en) | Multi-source heterogeneous log analysis method | |
CN109145764B (en) | Method and device for identifying unaligned sections of multiple groups of detection waveforms of comprehensive detection vehicle | |
CN114611604A (en) | User screening method based on electric drive assembly load characteristic fusion and clustering | |
CN112215307B (en) | Method for automatically detecting signal abnormality of earthquake instrument by machine learning | |
WO2022047659A1 (en) | Multi-source heterogeneous log analysis method | |
CN110852860A (en) | Vehicle maintenance reimbursement behavior abnormity detection method, equipment and storage medium | |
CN111984516B (en) | Log anomaly detection system based on SGSE-ECC | |
CN114756420A (en) | Fault prediction method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |