CN111984515B - Multi-source heterogeneous log analysis method - Google Patents

Multi-source heterogeneous log analysis method Download PDF

Info

Publication number
CN111984515B
CN111984515B CN202010911771.1A CN202010911771A CN111984515B CN 111984515 B CN111984515 B CN 111984515B CN 202010911771 A CN202010911771 A CN 202010911771A CN 111984515 B CN111984515 B CN 111984515B
Authority
CN
China
Prior art keywords
time window
feature
log
normal
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010911771.1A
Other languages
Chinese (zh)
Other versions
CN111984515A (en
Inventor
汪祖民
田纪宇
秦静
季长清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN202010911771.1A priority Critical patent/CN111984515B/en
Publication of CN111984515A publication Critical patent/CN111984515A/en
Application granted granted Critical
Publication of CN111984515B publication Critical patent/CN111984515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A multi-source heterogeneous log analysis method belongs to the field of log data processing, and aims to solve the problem of log analysis, and the method comprises the following steps: determining the size of a time window according to the response time required by the information system; step 2: processing the log data in each time window into samples which can be called by an ECC log analysis algorithm by using an SGSE algorithm; step 3: training and analyzing whether the time window is normal or not by using an ECC log analysis model; step 4: and the log analysis result is presented, and the effect can carry out abnormal analysis on the log.

Description

Multi-source heterogeneous log analysis method
Technical Field
The invention belongs to the field of log data processing, and relates to a multi-source heterogeneous log analysis method and system.
Background
With the development of internet technology, the number of logs generated by each device in an information system is increasing, and analysis of logs generated by different devices and with different data characteristics is an important component of operation and maintenance work. The multi-source heterogeneous log data is analyzed by an automatic means, so that whether the running state of the information system is abnormal or normal can be timely obtained, the safe and stable running of the information system is ensured, and the running and maintenance cost of enterprises is further reduced.
In the current technical method of multi-source heterogeneous log analysis, methods of single analysis, re-aggregation, association analysis and the like are used. In the single analysis and aggregation method, logs generated in a single device in an information system are analyzed to analyze the running state of each device, and then whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. However, in the method, the logs in different devices are not analyzed in a combined way, but are analyzed after the states of the different devices are judged independently, so that the relation between the logs of the different devices cannot be mined. In the association analysis method, characteristic events are generated according to the content of each field in the log, and similarity comparison is carried out after events generated by different devices under a time window are clustered, so that similar events are removed. And then merging the similar events of different devices, and finally generating a statistical report of various events. However, in this method, with the purpose of generating a statistical report of various events, the relationships between the various events cannot be deeply mined and directly presented to the user, and each event cannot be accurately classified only by using a clustering algorithm.
In the current technical method of multi-source heterogeneous log analysis, single analysis and re-aggregation are used [1] Correlation analysis [2] And the like. In the single analysis and aggregation method, logs generated in a single device in an information system are analyzed to analyze the running state of each device, and then whether the whole information system has abnormal conditions or not is judged according to the state of each device according to a preset rule. In the association analysis method, characteristic events are generated according to the contents of all fields in a log, the events generated by different devices under a time window are subjected to similarity comparison, and the similar events are removed. And then merging the similar events of different devices, and finally generating a statistical report of various events.
In the single analysis aggregation method, the logs in different devices are not analyzed in a combined way, but are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be mined. In the association analysis method, the purpose of generating a statistical report of various events is that the relationships between various events cannot be deeply mined and directly presented to the user.
Disclosure of Invention
In order to solve the problem of log analysis, the invention provides the following technical scheme: a multi-source heterogeneous log analysis method comprises the following steps:
step 1: determining the size of a time window according to the response time required by the information system;
step 2: processing the log data in each time window into samples which can be called by an ECC log analysis algorithm by using an SGSE algorithm;
step 3: training and analyzing whether the time window is normal or not by using an ECC log analysis model;
step 4: and presenting a log analysis result.
Further, the model training steps are as follows:
step 1: generating a log quantity state sub-sequence by counting the log quantity of multi-source heterogeneous log data in a normal time window and an abnormal time window, generating a user behavior state sub-sequence by counting the quantity of each log type generated on each device in the time window, and generating a field state sub-sequence by counting the quantity of times of types appearing in some important fields of each device in the time window;
step 2: generating (n+m+j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence under each time window;
step 3: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a log quantity state subsequence in each normal and abnormal time window as a label and sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v1=1-v2 (3)
f(tableα)=v1*M′ tableα +v2*M″ tableα +bias (4)
table alpha represents the corresponding sample of a feature in the log quantity state subsequence when it is tagged,
table alpha' represents the corresponding sample when a feature in the log quantity state subsequence is used as a tag in the normal time window,
table alpha "represents a sample corresponding to a feature in the log quantity state subsequence under the abnormal time window as a tag,
M tableα ' is the mean square error between the corresponding sample when a feature in the log number state subsequence is used as a label and the corresponding sample when a feature in the log number state subsequence is used as a label under the representative normal time window,
M tableα "mean square error between the corresponding sample when a certain feature in the log number state subsequence is used as a label and the corresponding sample when a certain feature in the log number state subsequence under the representative abnormal time window is used as a label,
bias is the term "paranoid",
v1 and v2 are respectively the change coefficients of mean square error of a normal time window and an abnormal time window, and v1=v2 during training;
f (table alpha) is a difference value calculated by taking a certain feature in a log quantity state subsequence in a trained time window as a label;
step 4: calculating the difference value of each normal time window through the formulas (1) - (4) and other normal and abnormal time windows and storing the difference value as a set U1, calculating the difference value of the abnormal time window through the formulas (1) - (4) and the normal and abnormal time windows and storing the difference value as P1, and obtaining the confidence interval sigma (alpha) of the log quantity state subsequence under the normal time window as sigma (alpha) = [ min (U1), max (U1) ]n [ min (P1), max (P1) ]
Step 5: the sample data set taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a label is respectively and pairwise calculated to obtain a difference value with the sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v3=1-v4 (7)
f(tableβ)=v3*M′ tableβ +v4*M″ tableβ +bias (8)
table beta represents the sample corresponding to a feature in the user behavior state subsequence when it is labeled,
table beta' represents a sample corresponding to a feature in the user behavior state sub-sequence under the normal time window as a tag,
the table beta "represents a sample corresponding to a feature in the user behavior state sub-sequence under the abnormal time window as a tag,
M tableβ ' is the mean square error between the corresponding sample when a feature in the user behavior state sub-sequence is used as a label and the corresponding sample when a feature in the user behavior state sub-sequence under the representative normal time window is used as a label,
M tableβ "mean square error between the sample corresponding to the user behavior state subsequence when a certain feature in the user behavior state subsequence is used as a label and the sample corresponding to the user behavior state subsequence under the representative abnormal time window when a certain feature in the user behavior state subsequence is used as a label,
bias is the term "paranoid",
v3 and v4 are the change coefficients of mean square error of the normal time window and the abnormal time window respectively, v3=v4 during training,
f (table beta) is a difference value calculated by taking a certain feature in a user behavior state subsequence in a trained time window as a label;
step 6: calculating the difference value of each normal time window and the normal time window through the steps (5) - (8) and the abnormal time window and storing the difference value as a set U 2 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (5) - (8) and storing the difference value as P 2 Obtaining confidence interval sigma (beta) of field state subsequence under normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
Step 7: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a field state subsequence in each normal and abnormal time window as a label and sample sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v5=1-v6 (11)
f(tableγ)=v5*M′ t a bleγ +v6*M″ tableγ +bias (12)
table gamma represents the corresponding sample when a feature in the field state subsequence is a tag,
table y' represents the corresponding sample when a feature in the field state subsequence is the tag under the normal time window,
table y "represents the corresponding sample when a feature in the field state subsequence under the abnormal time window is used as a tag,
the Mtable y' is the mean square error between the corresponding sample when a certain feature in the field state subsequence is used as a tag and the corresponding sample when a certain feature in the field state subsequence under the representative normal time window is used as a tag,
mtable gamma "is the mean square error between the corresponding sample when a feature in the field state subsequence is used as a tag and the corresponding sample when a feature in the field state subsequence under the representative anomaly time window is used as a tag,
bias is the term "paranoid",
v5, v6 are the coefficients of variation of the mean square error of the normal time window, the abnormal time window, respectively, v5=v6 during training,
f (table gamma) is a difference value calculated by taking a certain feature in a field state subsequence in a trained time window as a label;
step 8: calculating the difference value of each normal time window and the normal time window through (9) - (12) and saving the difference value as a set U 3 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (9) - (12) and storing the difference value as P 3 Obtaining confidence interval sigma (gamma) of log quantity state subsequence under normal time window as sigma (gamma) = [ min (U3), max (U3)]∩[min(P3),max(P3)]。
Further, the log analysis method comprises the following steps:
step 1: when analyzing a detected time window, randomly selecting a characteristic of each subsequence to form three labels and forming three samples to represent the detected time window;
step 2: assuming an initial value v1=v2, v3=v4, v5=v6, and calculating whether the difference values f (table α), f (table β), f (table γ) of three samples under the detected time window are within the corresponding confidence intervals σ (α), σ (β), σ (γ) according to formulas (1) - (12), respectively;
step 3: if all three samples are in the confidence interval, the v1, v2, v3, v4, v5 and v6 are constrained, and the constraint formula is that
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the change coefficients of mean square errors of normal time windows, v1, v3 and v5 are reduced, the influence of the normal time windows on the difference values is reduced, the influence of the abnormal time windows on the difference values is increased, according to new v1, v2, v3, v4, v5 and v6 and three samples, the formulas (1) - (12) are re-passed, whether the difference values of the three samples under the detected time windows are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is calculated, and the number of repeated constraint times is determined according to the missing report rate requirement;
step 4: if the difference value of the three samples under the receiving time window is still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the receiving time window, otherwise, the information system is considered to be abnormal.
The beneficial effects are that: the invention provides an SGSE algorithm for processing a multi-source heterogeneous log into a multi-dimensional sample which can represent the state of an information system for analysis by the algorithm. The invention provides an ECC algorithm for multi-source heterogeneous logs, which can analyze multi-dimensional samples and analyze the running state of an information system of the time window according to the multi-source heterogeneous logs under the time window.
Drawings
Fig. 1: a multi-source heterogeneous log analysis flow chart.
Fig. 2: the window state sample under test generates a graph.
Detailed Description
The invention provides a method for processing multi-source heterogeneous log data in an information system by SGSE (State Generation Sequential Extraction, state generation sequence extraction) algorithm, and provides a new ECC (Error Coefficient Constraint ) algorithm for judging the running state in the information system aiming at the characteristics of the multi-source heterogeneous log of the information system. Referring to fig. 1, the multi-source heterogeneous log analysis method of the present invention includes the following steps:
step 1: the size of the time window is determined based on the response time required by the information system.
Step 2: the logs in each time window are processed by using an SGSE algorithm, and log data in each time window are processed into samples.
Step 3: training and analyzing a time window to be analyzed by using an ECC log analysis model.
Step 4: and presenting a log analysis result.
In a preferred embodiment, the training of the ECC log analysis model is as follows:
step 1: generating a log quantity state sub-sequence by counting the log quantity of multi-source heterogeneous log data in a normal time window and an abnormal time window, generating a user behavior state sub-sequence by counting the quantity of each log type generated on each device in the time window, and generating a field state sub-sequence by counting the quantity of times of types appearing in some important fields of each device in the time window;
step 2: generating (n+m+j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence under each time window;
step 3: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a log quantity state subsequence in each normal and abnormal time window as a label and sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v1=1-v2 (3)
f(tableα)=v1*M′ tablrα +v2*M″ tableα +bias (4)
table alpha represents the corresponding sample of a feature in the log quantity state subsequence when it is tagged,
table alpha' represents the corresponding sample when a feature in the log quantity state subsequence is used as a tag in the normal time window,
table alpha "represents a sample corresponding to a feature in the log quantity state subsequence under the abnormal time window as a tag,
M tableα ' is the mean square error between the corresponding sample when a feature in the log number state subsequence is used as a label and the corresponding sample when a feature in the log number state subsequence is used as a label under the representative normal time window,
M tableα "mean square error between the corresponding sample when a certain feature in the log number state subsequence is used as a label and the corresponding sample when a certain feature in the log number state subsequence under the representative abnormal time window is used as a label,
bias is the term "paranoid",
v1 and v2 are respectively the change coefficients of mean square error of a normal time window and an abnormal time window, and v1=v2 during training;
f (table alpha) is a difference value calculated by taking a certain feature in a log quantity state subsequence in a trained time window as a label;
step 4: each normal time window is passed through (1)(4) Calculating difference value with other normal and abnormal time windows and storing the difference value as a set U 1 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (1) - (4) and storing the difference value as P 1 Obtaining confidence interval sigma (alpha) of log quantity state subsequence under normal time window as sigma (alpha) = [ min (U1), max (U1)]∩[min(P1),max(P1)]
Step 5: the sample data set taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a label is respectively and pairwise calculated to obtain a difference value with the sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v3=1-v4 (7)
f(tableβ)=v3*M′ tableβ +v4*M″ tableβ +bias (8)
table beta represents the sample corresponding to a feature in the user behavior state subsequence when it is labeled,
table beta' represents a sample corresponding to a feature in the user behavior state sub-sequence under the normal time window as a tag,
the table beta "represents a sample corresponding to a feature in the user behavior state sub-sequence under the abnormal time window as a tag,
M tableβ′ the mean square error between the corresponding sample when a certain feature in the user behavior state subsequence is used as a label and the sample when the certain feature in the user behavior state subsequence is used as a label under the representative normal time window,
M tableβ″ sample and representative abnormal time window corresponding to a feature in a user behavior state sub-sequence serving as a labelA certain feature in the user behavior state sub-sequence acts as a mean square error between corresponding samples of the tag,
bias is the term "paranoid",
v3 and v4 are the change coefficients of mean square error of the normal time window and the abnormal time window respectively, v3=v4 during training,
f (table beta) is a difference value calculated by taking a certain feature in a user behavior state subsequence in a trained time window as a label;
step 6: calculating the difference value of each normal time window and the normal time window through the steps (5) - (8) and the abnormal time window and storing the difference value as a set U 2 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (5) - (8) and storing the difference value as P 2 Obtaining confidence interval sigma (beta) of field state subsequence under normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
Step 7: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a field state subsequence in each normal and abnormal time window as a label and sample sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v5=1-v6 (11)
f(tableγ)=v5*M′ tableγ +v6*M″ tableγ +bias (12)
table gamma represents the corresponding sample when a feature in the field state subsequence is a tag,
table y' represents the corresponding sample when a feature in the field state subsequence is the tag under the normal time window,
table y "represents the corresponding sample when a feature in the field state subsequence under the abnormal time window is used as a tag,
M tableγ ' mean square error between a sample corresponding to a feature in the field state subsequence when the feature is tagged and a sample corresponding to a feature in the field state subsequence under the representative normal time window when the feature is tagged,
M tableγ "mean square error between the corresponding sample when a feature in the field state subsequence is the tag and the corresponding sample when a feature in the field state subsequence under the representative anomaly time window is the tag,
bias is the term "paranoid",
v5, v6 are the coefficients of variation of the mean square error of the normal time window, the abnormal time window, respectively, v5=v6 during training,
f (table gamma) is a difference value calculated by taking a certain feature in a field state subsequence in a trained time window as a label;
step 8: calculating the difference value of each normal time window and the normal time window through (9) - (12) and saving the difference value as a set U 3 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (9) - (12) and storing the difference value as P 3 Obtaining confidence interval sigma (gamma) of log quantity state subsequence under normal time window as sigma (gamma) = [ min (U3), max (U3)]∩[min(P3),max(P3)]。
In a preferred embodiment, the method of log analysis comprises the steps of:
step 1: when analyzing a detected time window, randomly selecting a characteristic of each subsequence to form three labels and forming three samples to represent the detected time window;
step 2: assuming an initial value v1=v2, v3=v4, v5=v6, and calculating whether the difference values f (table α), f (table β), f (table γ) of three samples under the detected time window are within the corresponding confidence intervals σ (α), σ (β), σ (γ) according to formulas (1) - (12), respectively;
step 3: if all three samples are in the confidence interval, the v1, v2, v3, v4, v5 and v6 are constrained, and the constraint formula is that
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the change coefficients of mean square errors of normal time windows, v1, v3 and v5 are reduced, the influence of the normal time windows on the difference values is reduced, the influence of the abnormal time windows on the difference values is increased, according to new v1, v2, v3, v4, v5 and v6 and three samples, the formulas (1) - (12) are re-passed, whether the difference values of the three samples under the detected time windows are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is calculated, and the number of repeated constraint times is determined according to the missing report rate requirement;
step 4: if the difference value of the three samples under the receiving time window is still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the receiving time window, otherwise, the information system is considered to be abnormal.
The invention also provides a log abnormality detection system, which comprises:
and the time window dividing module is used for determining the size of the time window according to the requirement of the information system on response time.
And the SGSE data processing module is used for processing the log data and processing the log data into sample data which can be called by the ECC log analysis model according to the time window.
And the ECC model training module is used for training an ECC log analysis model.
The ECC log analysis module is used for judging whether the tested time window is normal according to the tested time window analyzed by the ECC log analysis model, and analyzing whether the state of the information system under the time window is normal according to the logs of all the devices in the information system.
In one scheme, the time window dividing module determines the size of the time window as short as possible in response time of the user to the information system requirement.
In one scheme, the SGSE data processing module is divided into an SG state generating sub-module and an SE sequence extracting sub-module.
The SG state generation submodule determines each device in the information system such as WAF, load balancing, firewall and the like. And counting the log quantity of each device to generate a log quantity state subsequence, wherein the subsequence comprises the characteristics of WAF device log quantity, load balancing device log quantity and the like.
Table 1: log quantity state subsequence
Time window/device type WAF log number Load balancing log quantity Firewall log number ...... Nginx log number
Time window N α 1 α 2 α 3 ...... α n
Counting the number of each log category generated on each device in a time window to obtain a user behavior state subsequence, wherein the category determination is obtained by merging the types of fields with each entropy value not being 0, for example, an action field exists in WAF devices for recording the behavior of the WAF devices on access, and the types in the fields are alert and block; there are four types 200, 404, 500, 501 in the http_method field that records the http state, and from these two fields WAF devices can generate 2 x 4 different kinds of logs. The sub-sequence contains a number of features such as the number of WAF log categories, the number of firewall log categories, etc.
Table 2: user behavior state subsequence
And counting the number of times of occurrence of the types in some important fields of each device in the time window, and generating a field state sub-sequence. The field state subsequence contains features such as the number of action field alert types in the WAF, the number ratio of firewall protocol field TCP protocols, and the like.
Table 3: field state subsequence
The SE sequence extraction submodule sequentially extracts one feature in one subsequence as a label under the same time window, and combines the feature with all features of other two subsequences to form a sample, as shown in the following table:
table 4: WAF log number in log number state subsequence is tag feature merge table
Table 5: user behavior state subsequence WAF log category 1 as tag feature merge table
Table 6: the field state subsequence WAF action field alert number is a tag feature merge table
Through the SE sequential extraction sub-module, any feature in any one sub-sequence can be enabled to have all features in other two sub-sequences corresponding to the features. A correlation is produced between the generated sample features. When analyzing a time window, a feature of each sub-sequence is randomly selected to form three tags representing the received time window.
In one scheme, the ECC model training module is divided into a small number of normal and abnormal event window determining sub-modules and an ECC model training sub-module.
The small number of normal and abnormal event windows determine sub-modules. The problem of inaccurate classification is amplified to influence the analysis result when the analysis model is trained. In the existing information system history data, a small amount of time windows can be accurately determined to be normal or abnormal time windows according to professional knowledge.
The ECC model training submodule comprises the following steps of:
step 1: processing the multi-source heterogeneous log data in the normal and abnormal time windows through an SG state generation sub-module to obtain log quantity statistics of the log data to generate a log quantity state sub-sequence, carrying out statistics on the quantity of each log type generated on each device in the time window to generate a user behavior state sub-sequence, and carrying out quantity statistics on the number of times of occurrence of types in some important fields of each device in the time window to generate a field state sub-sequence.
Step 2: and extracting the sub-modules according to the SE sequence from n features in the log quantity state sub-sequence, m features in the user behavior state sub-sequence and j features in the field state sub-sequence under each time window to generate (n+m+j) sample data sets.
Step 2: generating (n+m+j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence under each time window;
step 3: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a log quantity state subsequence in each normal and abnormal time window as a label and sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v1=1-v2 (3)
f(tablea)=v1*M′ tableeα +v2*M″ tableα +bias (4)
table alpha represents the corresponding sample of a feature in the log quantity state subsequence when it is tagged,
table alpha' represents the corresponding sample when a feature in the log quantity state subsequence is used as a tag in the normal time window,
table alpha "represents a sample corresponding to a feature in the log quantity state subsequence under the abnormal time window as a tag,
M tableα ' is the mean square error between the corresponding sample when a feature in the log number state subsequence is used as a label and the corresponding sample when a feature in the log number state subsequence is used as a label under the representative normal time window,
M tableα "mean square error between the corresponding sample when a certain feature in the log number state subsequence is used as a label and the corresponding sample when a certain feature in the log number state subsequence under the representative abnormal time window is used as a label,
bias is the term "paranoid",
v1 and v2 are respectively the change coefficients of mean square error of a normal time window and an abnormal time window, and v1=v2 during training;
f (table alpha) is a difference value calculated by taking a certain feature in a log quantity state subsequence in a trained time window as a label;
step 4: calculating the difference value of each normal time window and other normal time windows according to the formulas (1) - (4) and storing the difference value as a set U 1 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (1) - (4) and storing the difference value as P 1 Obtaining confidence interval sigma (alpha) of log quantity state subsequence under normal time window as sigma (alpha) = [ min (U1), max (U1)]∩[min(P1),max(P1)]
Step 5: the sample data set taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a label is respectively and pairwise calculated to obtain a difference value with the sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v3=1-v4 (7)
f(tableβ)=v3*M′ tableβ +v4*M″ tableβ +bias (8)
table beta represents the sample corresponding to a feature in the user behavior state subsequence when it is labeled,
table beta' represents a sample corresponding to a feature in the user behavior state sub-sequence under the normal time window as a tag,
the table beta "represents a sample corresponding to a feature in the user behavior state sub-sequence under the abnormal time window as a tag,
M tableβ ' when a feature in a sub-sequence of behavior states of a user is used as a labelThe mean square error between the corresponding samples and the corresponding samples representing a feature in the user behavior state sub-sequence under the normal time window as a label,
M tableβ "mean square error between the sample corresponding to the user behavior state subsequence when a certain feature in the user behavior state subsequence is used as a label and the sample corresponding to the user behavior state subsequence under the representative abnormal time window when a certain feature in the user behavior state subsequence is used as a label,
bias is the term "paranoid",
v3 and v4 are the change coefficients of mean square error of the normal time window and the abnormal time window respectively, v3=v4 during training,
f (table beta) is a difference value calculated by taking a certain feature in a user behavior state subsequence in a trained time window as a label;
step 6: calculating the difference value of each normal time window and the normal time window through the steps (5) - (8) and the abnormal time window and storing the difference value as a set U 2 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (5) - (8) and storing the difference value as P 2 Obtaining confidence interval sigma (beta) of field state subsequence under normal time window as
σ(β)=[min(U2),max(U2)]∩[min(P2),max(P2)]
Step 7: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a field state subsequence in each normal and abnormal time window as a label and sample sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v5=1-v6 (11)
f(tableγ)=v5*M′ tableγ +v6*M″ tableγ +bias (12)
table gamma represents the corresponding sample when a feature in the field state subsequence is a tag,
table y' represents the corresponding sample when a feature in the field state subsequence is the tag under the normal time window,
table y "represents the corresponding sample when a feature in the field state subsequence under the abnormal time window is used as a tag,
M tableγ ' mean square error between a sample corresponding to a feature in the field state subsequence when the feature is tagged and a sample corresponding to a feature in the field state subsequence under the representative normal time window when the feature is tagged,
M tableγ "mean square error between the corresponding sample when a feature in the field state subsequence is the tag and the corresponding sample when a feature in the field state subsequence under the representative anomaly time window is the tag,
bias is the term "paranoid",
v5, v6 are the coefficients of variation of the mean square error of the normal time window, the abnormal time window, respectively, v5=v6 during training,
f (table gamma) is a difference value calculated by taking a certain feature in a field state subsequence in a trained time window as a label;
step 8: calculating the difference value of each normal time window and the normal time window through (9) - (12) and saving the difference value as a set U 3 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (9) - (12) and storing the difference value as P 3 Obtaining confidence interval sigma (gamma) of log quantity state subsequence under normal time window as sigma (gamma) = [ min (U3), max (U3)]∩[min(P3),max(P3)]。
In one scheme, the ECC log analysis module comprises the following analysis steps:
step 1: when analyzing a time window for the detected window, one feature of each sub-sequence is randomly selected to be three labels and three samples are formed to represent the detected time window, and the samples are shown in fig. 2.
Step 2: assuming that the initial values v1=v2, v3=v4 and v5=v6, the difference values f (table α), f (table β) and f (table γ) of the three samples under the tested window are respectively calculated according to the formulas (1), (2), (3) and (4) described by the ECC model training submodule, and whether the difference values f (table α), f (table β) and f (table γ) are within the corresponding confidence intervals σ (α), σ (β) and σ (γ) is determined.
Step 3: if all three samples are in the confidence interval, the detection precision is improved by restraining v1, v2, v3, v4, v5 and v6, wherein the restraining formula is that
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively changes of the normal time window, and when v1, v3 and v5 are reduced, the influence of the normal time window on the difference value is reduced, and the influence of the abnormal event window on the difference value is increased. The obtained new v1, v2, v3, v4, v5 and v6 and the three samples are re-brought into the formulas (1) - (12) described by the ECC model training submodule to check whether the difference value is in the confidence interval or not, the number of repeated constraint times is determined according to the preset missing report rate requirement, and if the missing report rate requirement is low, the number of constraint times is small. And the number of constraint times is more if the report missing rate is low.
Step 4: if the difference value of the time window is still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the received time window, otherwise, the information system is considered to be abnormal.
In the single analysis aggregation method, the logs in different devices are not analyzed in a combined way, but are analyzed after the states of the different devices are judged independently, and the relation among the logs of the different devices cannot be mined. The SGSE algorithm provided by the invention can effectively process logs among different devices and aggregate the logs into a sample capable of showing the state of the information system in the time window, comprehensively judging the overall state of the information system, and not analyzing a single device to obtain a result and then aggregating.
In the association analysis method, the purpose of generating a statistical report of various events is that the relationships between various events cannot be deeply mined and directly presented to the user. The multisource heterogeneous logs are subjected to deep analysis through an SGSE-ECC algorithm, statistical reports of various events are generated through clustering, and the states of the information system are visually presented to a user through relations among the deep mining logs.
In general, the invention uses the SGSE-ECC algorithm model to process data, generate samples, analyze states and automatically analyze logs for the multi-source heterogeneous logs, thereby reducing operation and maintenance cost.
The SGSE algorithm provided by the invention processes log data on each device, aggregates and generates samples capable of showing the state of a time window, and provides an ECC algorithm to analyze multidimensional samples, and the detection precision and the missing report rate are adjusted through constraint change coefficients.

Claims (2)

1. The multi-source heterogeneous log analysis method is characterized by comprising the following steps of:
step 1: determining the size of a time window according to the response time required by the information system;
step 2: processing the log data in each time window into samples which can be called by an ECC log analysis algorithm by using an SGSE algorithm;
step 3: training and analyzing whether the time window is normal or not by using an ECC log analysis model;
step 4: presenting a log analysis result;
the model training steps are as follows:
step 1: generating a log quantity state sub-sequence by counting the log quantity of multi-source heterogeneous log data in a normal time window and an abnormal time window, generating a user behavior state sub-sequence by counting the quantity of each log type generated on each device in the time window, and generating a field state sub-sequence by counting the quantity of times of types appearing in some important fields of each device in the time window;
step 2: generating (n+m+j) sample data sets from n features in the log quantity state subsequence, m features in the user behavior state subsequence and j features in the field state subsequence under each time window;
step 3: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a log quantity state subsequence in each normal and abnormal time window as a label and sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v1=1-v2 (3)
f(tableα)=v1*M′ tableα +v2*M″ tableα +bias (4)
table alpha represents the corresponding sample of a feature in the log quantity state subsequence when it is tagged,
table alpha' represents the corresponding sample when a feature in the log quantity state subsequence is used as a tag in the normal time window,
table alpha "represents a sample corresponding to a feature in the log quantity state subsequence under the abnormal time window as a tag,
M tableα ' is the mean square error between the corresponding sample when a feature in the log number state subsequence is used as a label and the corresponding sample when a feature in the log number state subsequence is used as a label under the representative normal time window,
M tableα "mean square error between the corresponding sample when a certain feature in the log number state subsequence is used as a label and the corresponding sample when a certain feature in the log number state subsequence under the representative abnormal time window is used as a label,
bias is the term "paranoid",
v1 and v2 are respectively the change coefficients of mean square error of a normal time window and an abnormal time window, and v1=v2 during training;
f (table alpha) is a difference value calculated by taking a certain feature in a log quantity state subsequence in a trained time window as a label;
step 4: calculating the difference value of each normal time window and other normal time windows according to the formulas (1) - (4) and storing the difference value as a set U 1 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (1) - (4) and storing the difference value as P 1 Obtaining confidence interval sigma (alpha) of log quantity state subsequence under normal time window as
σ(α)=[min(U1),max(U1)]∩[min(P1),max(P1)]
Step 5: the sample data set taking a certain characteristic in the user behavior state subsequence in each normal and abnormal time window as a label is respectively and pairwise calculated to obtain a difference value with the sample data sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v3=1-v4 (7)
f(tableβ)=v3*M′ tableβ +v4*M″ tableβ +bias (8)
table beta represents the sample corresponding to a feature in the user behavior state subsequence when it is labeled,
table beta' represents a sample corresponding to a feature in the user behavior state sub-sequence under the normal time window as a tag,
the table beta "represents a sample corresponding to a feature in the user behavior state sub-sequence under the abnormal time window as a tag,
M tableβ ' sample corresponding to when a certain feature in the sub-sequence of user behavior state is taken as a label and sample corresponding to when a certain feature in the sub-sequence of user behavior state under the representative normal time window is taken as a labelThe mean square error between the two,
M tableβ "mean square error between the sample corresponding to the user behavior state subsequence when a certain feature in the user behavior state subsequence is used as a label and the sample corresponding to the user behavior state subsequence under the representative abnormal time window when a certain feature in the user behavior state subsequence is used as a label,
bias is the term "paranoid",
v3 and v4 are the change coefficients of mean square error of the normal time window and the abnormal time window respectively, v3=v4 during training,
f (table beta) is a difference value calculated by taking a certain feature in a user behavior state subsequence in a trained time window as a label;
step 6: calculating the difference value of each normal time window and the normal time window through the steps (5) - (8) and the abnormal time window and storing the difference value as a set U 2 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (5) - (8) and storing the difference value as P 2 Obtaining confidence interval sigma (beta) of field state subsequence under normal time window as sigma (beta) = [ min (U2), max (U2)]∩[min(P2),max(P2)]
Step 7: the method comprises the steps of calculating difference values of a sample data set taking a certain characteristic in a field state subsequence in each normal and abnormal time window as a label and sample sets in other normal and abnormal time windows according to an ECC expression, wherein the calculation expression is as follows:
v5=1-v6 (11)
f(tableγ)=v5*M′ tableγ +v6*M″ tableγ +bias (12)
table gamma represents the corresponding sample when a feature in the field state subsequence is a tag,
table y' represents the corresponding sample when a feature in the field state subsequence is the tag under the normal time window,
table y "represents the corresponding sample when a feature in the field state subsequence under the abnormal time window is used as a tag,
M tableγ ' mean square error between a sample corresponding to a feature in the field state subsequence when the feature is tagged and a sample corresponding to a feature in the field state subsequence under the representative normal time window when the feature is tagged,
M tableγ "mean square error between the corresponding sample when a feature in the field state subsequence is the tag and the corresponding sample when a feature in the field state subsequence under the representative anomaly time window is the tag,
bias is the term "paranoid",
v5, v6 are the coefficients of variation of the mean square error of the normal time window, the abnormal time window, respectively, v5=v6 during training,
f (table gamma) is a difference value calculated by taking a certain feature in a field state subsequence in a trained time window as a label;
step 8: calculating the difference value of each normal time window and the normal time window through (9) - (12) and saving the difference value as a set U 3 Calculating the difference value between the abnormal time window and the normal time window according to the formulas (9) - (12) and storing the difference value as P 3 Obtaining confidence interval sigma (gamma) of log quantity state subsequence under normal time window as
σ(γ)=[min(U3),max(U3)]∩[min(P3),max(P3)]。
2. The multi-source heterogeneous log analysis method of claim 1, wherein the log analysis method comprises the steps of:
step 1, when analyzing a detected time window, randomly selecting a characteristic of each subsequence to form three labels and forming three samples to represent the detected time window;
step 2: assuming an initial value v1=v2, v3=v4, v5=v6, and calculating whether the difference values f (table α), f (table β), f (table γ) of three samples under the detected time window are within the corresponding confidence intervals σ (α), σ (β), σ (γ) according to formulas (1) - (12), respectively;
step 3: if all three samples are in the confidence interval, the v1, v2, v3, v4, v5 and v6 are constrained, and the constraint formula is that
v1=a1*v1(0≤a<1)
v3=a2*v3(0≤a2<1)
v5=a3*v5(0≤a3<1)
v1, v3 and v5 are respectively the change coefficients of mean square errors of normal time windows, v1, v3 and v5 are reduced, the influence of the normal time windows on the difference values is reduced, the influence of the abnormal time windows on the difference values is increased, according to new v1, v2, v3, v4, v5 and v6 and three samples, the formulas (1) - (12) are re-passed, whether the difference values of the three samples under the detected time windows are in the corresponding confidence intervals sigma (alpha), sigma (beta) and sigma (gamma) is calculated, and the number of repeated constraint times is determined according to the missing report rate requirement;
step 4: if the difference value of the three samples under the receiving time window is still in the confidence interval after the repeated constraint is finished, the information system is considered to be normal in the receiving time window, otherwise, the information system is considered to be abnormal.
CN202010911771.1A 2020-09-02 2020-09-02 Multi-source heterogeneous log analysis method Active CN111984515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010911771.1A CN111984515B (en) 2020-09-02 2020-09-02 Multi-source heterogeneous log analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010911771.1A CN111984515B (en) 2020-09-02 2020-09-02 Multi-source heterogeneous log analysis method

Publications (2)

Publication Number Publication Date
CN111984515A CN111984515A (en) 2020-11-24
CN111984515B true CN111984515B (en) 2024-01-23

Family

ID=73447981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010911771.1A Active CN111984515B (en) 2020-09-02 2020-09-02 Multi-source heterogeneous log analysis method

Country Status (1)

Country Link
CN (1) CN111984515B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994713B (en) * 2023-03-22 2023-06-16 中国人民解放军火箭军工程大学 Operation training effect evaluation method and system based on multi-source data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546312A (en) * 2013-08-27 2014-01-29 中国航天科工集团第二研究院七〇六所 Massive multi-source isomerism log correlation analyzing method
US10509695B1 (en) * 2015-03-30 2019-12-17 ThetaRay Ltd. System and method for anomaly detection in dynamically evolving data using low rank matrix decomposition
CN110647448A (en) * 2019-08-09 2020-01-03 北京建筑大学 Mobile application operation log data real-time analysis method, server and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180129579A1 (en) * 2016-11-10 2018-05-10 Nec Laboratories America, Inc. Systems and Methods with a Realtime Log Analysis Framework

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546312A (en) * 2013-08-27 2014-01-29 中国航天科工集团第二研究院七〇六所 Massive multi-source isomerism log correlation analyzing method
US10509695B1 (en) * 2015-03-30 2019-12-17 ThetaRay Ltd. System and method for anomaly detection in dynamically evolving data using low rank matrix decomposition
CN110647448A (en) * 2019-08-09 2020-01-03 北京建筑大学 Mobile application operation log data real-time analysis method, server and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多源异构日志综合分析技术研究与实践;刘必雄;;南京信息工程大学学报(自然科学版)(第04期);全文 *

Also Published As

Publication number Publication date
CN111984515A (en) 2020-11-24

Similar Documents

Publication Publication Date Title
WO2020038353A1 (en) Abnormal behavior detection method and system
US10061637B1 (en) System, method, and computer program for automatic root cause analysis
Peling et al. Implementation of Data Mining To Predict Period of Students Study Using Naive Bayes Algorithm
US20170140309A1 (en) Database analysis device and database analysis method
US11221904B2 (en) Log analysis system, log analysis method, and log analysis program
CN109934268B (en) Abnormal transaction detection method and system
CN103761173A (en) Log based computer system fault diagnosis method and device
US20140258254A1 (en) Analyzing database cluster behavior by transforming discrete time series measurements
Khan et al. Guidelines for assessing the accuracy of log message template identification techniques
CN112084229A (en) Method and device for identifying abnormal gas consumption behaviors of town gas users
CN113535454B (en) Log data anomaly detection method and device
CN114254716B (en) High-risk operation identification method and system based on user behavior analysis
CN115277180B (en) Block chain log anomaly detection and tracing system
US11216512B2 (en) Accessible machine learning backends
CN111984515B (en) Multi-source heterogeneous log analysis method
CN117081858A (en) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
WO2022047659A1 (en) Multi-source heterogeneous log analysis method
CN114611604A (en) User screening method based on electric drive assembly load characteristic fusion and clustering
CN111984516B (en) Log anomaly detection system based on SGSE-ECC
CN113723555A (en) Abnormal data detection method and device, storage medium and terminal
CN112612765A (en) Flow variant difference analysis method and system based on drift detection
CN112882898A (en) Anomaly detection method, system, device and medium based on big data log analysis
CN116383645A (en) Intelligent system health degree monitoring and evaluating method based on anomaly detection
CN113657726B (en) Personnel risk analysis method based on random forest
US11514349B1 (en) Apparatus and methods of unsupervised machine learning models to identify seasonality and predicting seasonally-influenced metric values

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant