CN111565192A - Credibility-based multi-model cooperative defense method for internal network security threats - Google Patents

Credibility-based multi-model cooperative defense method for internal network security threats Download PDF

Info

Publication number
CN111565192A
CN111565192A CN202010382950.0A CN202010382950A CN111565192A CN 111565192 A CN111565192 A CN 111565192A CN 202010382950 A CN202010382950 A CN 202010382950A CN 111565192 A CN111565192 A CN 111565192A
Authority
CN
China
Prior art keywords
log
training
inconsistency
algorithm
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010382950.0A
Other languages
Chinese (zh)
Inventor
王志
陈炜嘉
付晏升
王雨奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202010382950.0A priority Critical patent/CN111565192A/en
Publication of CN111565192A publication Critical patent/CN111565192A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Abstract

An intranet security threat multi-model cooperative defense method based on credibility. The method is realized by the following steps: 1, extracting a heterogeneous log template set from a mass of logs by utilizing thirteen log analysis algorithms such as LogSig, partitioning by block _ id to generate a feature matrix, learning the feature matrix by utilizing three machine learning algorithms such as SVM and the like, and establishing thirty-nine log detection models; 2, calculating the credibility of different detection models to the prediction result of the log to be tested through a statistical learning algorithm; and 3, fusing the multi-model prediction results by using the credibility of the multi-model prediction results obtained by calculation, so as to realize the cooperative defense of the heterogeneous model. The method is different from an analysis mode based on a threshold value and a single model, and utilizes thirteen log analysis algorithms and three machine learning algorithms to generate a log detection model so as to realize multi-model cooperation; by using the statistical learning method, the detection capability of the abnormal log is improved.

Description

Credibility-based multi-model cooperative defense method for internal network security threats
Technical Field
The present invention belongs to the field of computer network security.
Background
With the continuous development of computer networks, network security issues become a focus of attention and a challenge. The intranet security threat is a focus of attention in the network security problem, and intranet attacks are frequent and have strong aggressivity. At present, the quantity of logs generated by equipment is larger and larger, manual analysis is difficult, and meanwhile, a single model is degraded along with time, so that a comprehensive and accurate detection result cannot be obtained, and therefore a model which can analyze the logs by using a machine and can defend through multi-model cooperation is required to be constructed to find threats.
The system comprehensively uses a plurality of log analysis algorithms, machine learning algorithms and statistical learning algorithms to cooperatively analyze the problem of intranet attack, and improves the accuracy and stability.
Disclosure of Invention
The invention aims to solve the problems that under the condition that a large number of logs are generated in an intranet, the logs are easy to tamper and cannot be combined for use, and a model is degraded, so that a comprehensive and accurate result cannot be obtained in prediction, and provides an intranet security threat intelligent analysis method based on statistical learning. The method utilizes machine learning analysis instead of manual analysis to realize the analysis of the logs and realize the combined use of the logs generated by different devices in the intranet; the method supports various log analysis models and realizes multi-model cooperation; and the machine learning algorithm and the statistical learning algorithm are utilized to improve the identification capability of the abnormal log.
Technical scheme of the invention
An intranet security threat multi-model cooperative defense method based on credibility relates to the basic concepts:
(1) logging: recording text of valuable application and system running information, wherein the content comprises a time stamp and the like;
(2) log stream: a log having a standard format, usable as input to a log parsing algorithm;
(3) log block: a series of logs generated based on the same behavior;
(4) a log template: the log stream is only composed of constants in the log stream and can represent a group of log streams;
(5) machine learning: researching how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer;
(6) inconsistency metric function: evaluating the similarity of the log stream to be tested and the known log stream set through the score; describing the similarity between a log stream and a group of known log streams, inputting a log template and a log stream to be tested, outputting a numerical value, namely an inconsistency score, wherein the higher the score is, the more similar the log stream to be tested is to the group of log streams; the lower the score is, the more dissimilar the log stream to be tested is to the group of log streams;
(7) P-Value: the statistical quantity of the significance of the log stream to be measured in the known log stream set is measured and used for comparing the credibility of the multi-model prediction result;
the method comprises the following specific steps:
1, calculating a multi-model inconsistency score, which comprises the following steps:
step 1.1, generating a log template set to obtain a characteristic value of a log;
1.1.1, the original log is first preprocessed to generate a log stream set. Processing the log stream set by using a log analysis algorithm f to obtain a log template set T;
1.1.2, input of log analysis: original log stream set X, log parsing algorithm set G:
① Log stream set X containing n log blocks Xj,j∈{1,2,…,n},X={x1,…,xn};
② Log resolution Algorithm set F containing m Log resolution algorithms Fk,k∈{1,2,…,n},F={f1,…,fn}; the input of the algorithm set is a log stream set, and the log stream set is returned as a log template set T.
1.1.3, output of log analysis: set of m log templates Tk,k∈{1,2,…,m},T={T1,…,Tm}; wherein each log template set TkContaining q log templates tkb,b∈{1,2,…,q},Tk={tk1,…,tkq}。
Step 1.2, template preprocessing, namely partitioning the log according to different events;
1.2.1, for a log template set generated by each heterogeneous log analysis algorithm, firstly, partitioning according to block _ id contained in the content of the log template, and then, dividing into a training set and a test set according to a partition proportion and a partition mode given by a user;
1.2.2, preprocessed input: a log template set T comprising m log template sets Tm,k∈{1,2,…,s},T={T1,…,Ts}; wherein each log template set TkContaining q log templates tkb,b∈{1,2,…,q},Tk={tk1,…,tkq};
1.2.3, output of preprocessing: log block set B containing m blocks by block _ idkK ∈ {1,2, …, m }, where each set of log blocks contains r log blocks bki,i∈{1,2,…,r},Bk={bk1,…,bkr}; and each log block set is divided into a training set CkK ∈ {1,2, …, m }, where each training set contains h log blocks cki,i∈{1,2,…,h},Ck={ck1,…,ckh}, and test set DkK ∈ {1,2, …, m }, where each test set contains l log blocks dki,i∈{1,2,…,l},Dk={dk1,…,dkl},Bk={Ck,Dk}。
1.3, extracting features to generate a feature matrix;
1.3.1, performing feature extraction on a training set and a test set in each heterogeneous log block set to generate a matrix for training a machine learning model;
1.3.2, input of feature extraction: training set C, testing set D:
① training set C containing m training sets Ci,i∈{1,2,…,m};
② test set D containing m test sets Di,i∈{1,2,…,m};
1.3.3, output of feature extraction: training matrix E, testing matrix F:
① training matrix E consisting of m training matrices Ei,i∈{1,2,…,m};
② test matrix F containing m training matrices Fi,i∈{1,2,…,m};
Step 1.4, training set inconsistency measurement
1.4.1. Each heterogeneous training set CiI ∈ {1,2, …, m }, calculating an inconsistency score α by using a machine learning algorithm g according to the trained model;
1.4.2, input of training set inconsistency measure: training set C, set of machine learning algorithms (inconsistency metric function) G:
① training set C containing m training sets CiI ∈ {1,2, …, m }, where each training set contains h log blocks cki,i∈{1,2,…,h},Ck={ck1,…,ckh};
② machine learning algorithm set G comprising p log parsing algorithms Gk,k∈{1,2,…,m},G={g1,…,gpThe returned value of the algorithm is the probability that a certain log block is normal and abnormal, namely the inconsistency score;
1.4.3, output of training set inconsistency measure: a set of inconsistency scores;
1.4.4, algorithm flow:
Figure BDA0002482888460000031
step 1.5, carrying out inconsistency measurement on the log template set to be measured by the plurality of models respectively;
1.5.1. Each heterogeneous test set diCalculating an inconsistency score α by using a machine learning algorithm g according to the trained model, wherein the inconsistency scores given by the heterogeneous model have no comparability, and the quality of a model prediction result cannot be directly compared according to the inconsistency scores;
1.5.2, input of test set inconsistency metric: test set D, set of machine learning algorithms (inconsistency metric function) G:
① test set D containing m test sets DiI ∈ {1,2, …, m }, where each test set contains l log blocks dki,i∈{1,2,…,l},Dk={dk1,…,dkl};
② machine learning algorithm set G comprising p log parsing algorithms Gk,k∈{1,2,…,m},G={g1,…,gpThe returned value of the algorithm is the probability that a certain log block is normal and abnormal, namely the inconsistency score;
1.5.3, output of test set inconsistency metric: a set of inconsistency scores;
1.5.4, algorithm flow:
Figure BDA0002482888460000041
2, calculating the reliability of the model prediction result by using a statistical learning algorithm;
calculating P-Value of each model prediction result, and finally obtaining the reliability of the prediction result according to the P-Value;
2.1, to the feature matrix tkbIn particular, a feature vector x is generated for the original log blocks in the test set and training setjAn inconsistency measure is performed, resulting in a corresponding set of inconsistency scores { αkb_1kb_2,…,αkb_j};
2.2 Log Block y test setjDiscordance score αkb_iPut into the training set's set of inconsistency scores, such P-Value values are less than or equal to the test set Log Block yjDiscordance score αkb_1The ratio of the number of log blocks to the total number of log blocks;
the larger the P-Value of 2.3 indicates that the log block y to be tested isjThe higher the significance in this class;
2.4, input: training an inconsistency scoring set of the set log blocks;
and 2.5, outputting: log block y to be testedjP-Value of (1);
2.6, algorithm flow:
Figure BDA0002482888460000042
step 3, performing cooperative defense by multiple models, and jointly predicting the malicious degree of a certain behavior;
and based on the result of the P-Value, fusing the multi-model prediction results by using a simple voting mode according to the maximum error probability given by the user, so as to achieve the cooperative defense.
3.1, calculating P-Value values smaller than the maximum error probability for all the P-Value sets obtained in the previous step to obtain training models corresponding to the P-Value values, if the number of the training models is smaller than the set number s of the models, predicting the log block yi to be tested to be a normal log block, otherwise predicting the log stream y to be testedjIs an abnormal log block;
3.2, input: test set Log Block yjA set of P-Value values of; an acceptable maximum error probability, provided by the user, indicating a maximum error probability that the user can accept;
3.3, output: predicting the result;
3.4, algorithm flow:
Figure BDA0002482888460000051
the invention has the advantages and positive effects that:
the invention provides an intranet multi-model defense system based on credibility. The method utilizes machine learning analysis instead of manual analysis to realize analysis of the log, thereby improving the analysis efficiency; the method realizes the merging use of the logs generated by different equipment of the internal network, and improves the efficiency of detecting the abnormity; the system obtains the credibility of each model by utilizing statistical learning, and performs fusion of a plurality of models based on the credibility, thereby changing the traditional analysis mode based on a threshold value; the method supports various heterogeneous log analysis models, and realizes multi-model cooperative defense by using machine learning and statistical learning methods.
Drawings
Fig. 1 is a flow chart of an intranet security threat intelligent analysis method based on statistical learning.
FIG. 2 is a log template set obtained by using IPLoM algorithm for industrial control logs.
FIG. 3 is a log template set obtained by using a Drain algorithm for an industrial control log.
FIG. 4 is a log template set obtained by using a LogSig algorithm for an industrial control log.
FIG. 5 is the result of a partial log being sorted by BlockId
Fig. 6 is a partial matrix generated by extracting features from a log block set.
FIG. 7 is a P-Value set obtained by using a LogSig algorithm and a decision tree model for an industrial control log.
FIG. 8 is a determination of an anomaly log by a multi-model system.
FIG. 9 is a decision of normal logs for a multi-model system.
FIG. 10 shows the accuracy of the multi-model to single model at a threshold of 0.7 and a ratio value of 13.
FIG. 11 shows the recall ratio of multi-model to single model at a threshold of 0.7 and a ratio value of 13.
FIG. 12 shows F1_ measure for multi-model and single model with a threshold of 0.7 and a ratio value of 13.
FIG. 13 shows the specificity of the multi-model to the single model when the threshold is 0.7 and the ratio value is 13.
Detailed Description
The invention specifically describes the detection of abnormal log blocks as an example, any log analysis algorithm and any machine learning algorithm which can obtain a log template set by inputting an original log stream set can be used in the method, the flow of the method is shown in fig. 1, six log analysis algorithms of AEL, Drain, IPLoM, LogSig, SHISO and Spell and three machine learning algorithms of decision tree, support vector machine and logistic regression are exemplified in the embodiment, and the specific description is as follows:
AEL (inverting Execution logs) is a log parsing algorithm. The algorithm is divided into four steps when the log is analyzed: the first step is anonymization, where the algorithm uses heuristic methods to identify the markers in the log lines that correspond to the dynamically variable portions; secondly, tokenizing, wherein an algorithm divides anonymous log lines into different groups according to the number of words and parameters in each log line; third, comparing the log lines in each group and abstracting them into corresponding execution events; the fourth step is to re-examine all the execution events to identify the events to be merged, resulting in the final log template.
IPLoM is a log resolution algorithm. The algorithm is divided into four steps when the log is analyzed, and all the log is input at the beginning. The first step is that all original logs are divided into different groups according to the length; the second step is to continue grouping the original logs in the same log length group. All words in the same position of all log records are counted, the position with the least unique words is found, and the words are classified according to the unique words. Raw logs with the same unique word are grouped into a group. The third step will operate on each group obtained in the previous step. And dividing the original log into different groups according to the corresponding relation between words. And step four, processing each finally obtained group in the step. And if the words at the same position are different, replacing the words with wildcards, and directly writing down the same words to obtain the log template.
Drain is a log parsing algorithm. The algorithm is divided into three steps when the log is analyzed, and one log is input at a time. The purpose of the first step is to group original logs with the same length into a group; the second step is to classify the logs with the same token and the same length into a group, wherein the token in the method can be the first word of the log, the last word of the log and a wildcard; the third step is to gather the similar logs in the original logs with the same length and the same token together to obtain a log template which can represent the set.
LogSig is a log parsing algorithm. The algorithm is divided into three steps when the log is analyzed, and all the log is input at the beginning. The first step is that each log stream is cut into a plurality of word pairs to form word pair groups; the second step is that all logs are randomly divided into groups with fixed number, and then each log is transferred to a proper group; the third step is to obtain a suitable log template for each group obtained in the previous step.
The SHISO is a method of mining the log format and retrieving log types and parameters in an online manner. The SHISO may continuously optimize the log format in real time by creating a structured tree using nodes generated from the log messages. The method has a mechanism to continuously refine the identified log format, which can be searched for new log messages and to extract log parameters. The method can rapidly analyze the log with the unknown format and extract parameters from the log information, easily know the log types and the parameters in a large amount of system logs, and enable a system log analysis tool to identify and import unstructured log information immediately after the log is transmitted.
Spell is a structured streaming parser for event logs using the LCS (longest common subsequence) method. The algorithm is divided into three steps when the log is analyzed, and one log is input at a time. The algorithm runs in a streaming manner, initially the LCSMap list is empty. The first step is to use a set of delimiters to separate new log entries eiResolved into token sequence Si(ii) a The second step is toiCompares with the LCSeq' S of all LCSObjects in the current LCSMAP to see SiWhether or not to "match" one of the existing LCSSseq's (hence, line id)iAdded to the corresponding lcscobject) or we need to create a new lcscobject for the LCSMap; and thirdly, obtaining a new LCS.
Logistic Regression (logistic Regression) is a statistical model widely used for classification, and is a generalized linear Regression analysis model. And (3) establishing a cost function in the face of a regression or classification problem, then iteratively solving the optimal model parameters by an optimization method, and testing and verifying the quality of the solved model. To determine the state of an instance, logistic regression estimates the probability p of all possible states (normal or abnormal). The probability p is calculated by a logistic function, which is built on the labeled training data. When a new instance occurs, the logistic function can compute the probability p for all possible states (0< p < 1). After the probability is obtained, the state with the maximum probability is classified output
A Decision Tree (Decision Tree) is a Tree structure diagram that uses branches to illustrate the prediction state of each instance. The model can efficiently classify unknown data, and is a graphical method for intuitively applying probability analysis. Decision tree models are often used to solve classification and regression problems. A decision tree is a predictive model that is constructed in a top-down manner using training data and represents a mapping between object attributes and object values. Each tree node is created using the current "best" attribute, which is selected by the information gain of the attribute. Its branches represent objects that meet the node conditions.
A Support Vector Machine (Support Vector Machine) is a supervised classification learning method. Given a set of training instances, each labeled as belonging to one or the other of two classes, the SVM training algorithm creates a model that assigns the new instance to one of the two classes, making it a non-probabilistic binary linear classifier. In SVMs, the hyperplane is structured to separate instances of different classes in a high dimensional space. Finding the hyperplane is an optimization problem that maximizes the distance between the hyperplane and the nearest data points in the different classes, reducing the generalization error of the classifier.
Each log analysis algorithm can process the original log set to obtain a template set T capable of representing the log set. For the log x to be tested, an inconsistency score set can be calculated by utilizing an inconsistency measurement function (namely a machine learning algorithm) g according to the obtained log template set T. The inconsistency scores given by the heterogeneous models are not comparable, and the quality of the model prediction result cannot be directly compared according to the inconsistency scores, so that the logs to be tested need to be predicted based on statistical learning after the inconsistency scores are obtained by the heterogeneous models.
1. Get a set of log templates
In the embodiment, an HDFS log set, which is open source data of Logpai in GitHub, is used for testing, the size of the data set is 1.5G, the data set can be divided into 575061 different log blocks according to different events, the log blocks are labeled by experts in related fields, wherein 558223 normal log blocks are shared, and 16838 abnormal log blocks are shared. Further, the distribution time of these logs was 2008/11/09 at 20:35:18 to 2008/11/11 at 11:16: 28. Each log message is converted to a specific event template associated with a key parameter for subsequent analysis by processing the log stream set using six log parsing algorithms AEL, Drain, IPLoM, LogSig, SHISO, and Spell. The IPLoM algorithm combines the log length, the location of the word tokens and the mapping between word tokens to generate a log event template, as shown in FIG. 2. The Drain algorithm utilizes a directed acyclic graph that can be automatically generated and updated to generate a log time template, which is used primarily in online and distributed systems, as shown in FIG. 3. The LogSig algorithm first converts each log into a plurality of word pairs including two words and a position between them, then performs a grouping operation on them, and finally obtains a log event template, as shown in fig. 4.
2. Preprocessing a set of log templates
In this embodiment, for the log template set generated by each heterogeneous log analysis algorithm, firstly, the block _ id included in the content of the log template is partitioned, as shown in fig. 5, each log block includes a plurality of log templates, and then the log template set is partitioned into a training set and a test set according to a partition ratio and a partition manner given by a user.
3. Feature extraction for training set and test set
In the present embodiment, feature extraction is performed on a training set and a test set in each heterogeneous log block set, and a matrix for training a machine learning model is generated, as shown in fig. 6.
4. Computing a training set inconsistency score
In this embodiment, the computation of the inconsistency score is performed on the training set by using three machine learning algorithms, namely a decision tree, a support vector machine and a logistic regression. Each training set will get the same set of inconsistency scores as the number of machine learning models used.
5. Computing a test set inconsistency score
In this embodiment, for each heterogeneous test set, the inconsistency score is calculated by using the model trained in the previous step and using a machine learning algorithm. Since the inconsistency scores given by different models are not comparable, the quality of the model prediction result cannot be directly compared according to the inconsistency scores.
6. Calculating P-Value
In this embodiment, one of the machine learning algorithms is used to score the feature matrix generated by one training set, and the corresponding inconsistency score is taken as S, and a trained machine learning model is obtained, and the model is used to score the feature vector x generated by the test set log block, so as to obtain an inconsistency score a. And placing the inconsistency score alpha of the feature vector x into an inconsistency score set S corresponding to the training set, wherein the P-Value in the template is the ratio of the number of log blocks with the inconsistency score smaller than or equal to alpha in the set S to the total number of the log blocks. And calculating the same log block to be tested by using different machine learning models to obtain the P-Value set corresponding to each log block. Here, an example of a corresponding P-Value set after a test set is scored by a decision tree model trained by a training set generated by a LogSig algorithm is shown in fig. 7.
7. Detecting log streams based on statistical learning
Calculating P-Value values smaller than the maximum error probability for all the P-Value sets obtained in the previous step to obtain training models corresponding to the P-Value values, and if the number of the training models is smaller than the set number s of the models, predicting the log block y to be testediIs a normal log block, otherwise, the log stream y to be tested is predictediIs an exception log block.
Fig. 8 shows a case where a log block is determined to be abnormal by the method, and it can be seen that, although the log block is abnormal, both the Spell-LR model and the Spell-SVM model determine that the log block records a normal behavior with a relatively high p-value. Similarly, fig. 9 shows a case where the method determines a log block as normal, and although the log block is normal, there are 6 models that erroneously determine it as an abnormal log block.
FIG. 10 shows the comparison of multiple models utilized by the method with a single model in terms of accuracy, recall, F1_ measure, and specificity.
8. General algorithm flow
(1) Inputting: a log stream set X, a log Y to be tested, a log analysis algorithm set F and a machine learning algorithm set G:
① Log stream set X containing n logs Xj,j∈{1,2,…,n},X={x1,…,xn};
② Log block Y to be tested containing s original logs Yj,j∈{1,2,…,s};
③ Log resolution Algorithm set F containing m Log resolution algorithms Fk,k∈{1,2,…,m},F={f1,…,fm}; the input of the algorithm set is a log stream set, and a log template set is returned;
④ Log template set T containing m Log template sets Tk,k∈{1,2,…,m},T={T1,…,Tm};
⑤ Log template set B classified by Log Block comprising m Log template sets B partitioned by Block _ idkK ∈ {1,2, …, m }, where each log template set contains r log blocks bki,i∈{1,2,…,r},Bk={bk1,…,bkr};
⑥ machine learning algorithm set G, comprising p log parsing algorithms Gk,k∈{1,2,…,m},G={g1,…,gpThe returned value is the probability that a certain log block is normal and abnormal, namely the inconsistency score, and the input of the function is a feature vector v in a feature matrix generated by a log stream setiThe return value is two real numbers, which indicates the similarity of the feature matrix and the log stream set;
the acceptable maximum error probability is provided by the user and indicates the maximum error probability which can be accepted by the user;
⑧ model number ratio provided by the user, which indicates that if the number of training models corresponding to the P-Value less than the maximum error probability is less than the set model number ratio, the log block y to be tested is predictediIs a normal log block, otherwise, the log stream y to be tested is predictediIs an exception log block.
(2) And (3) outputting:
and predicting the result, namely whether the log block y to be tested is normal or abnormal.
(3) The algorithm flow is as follows:
Figure BDA0002482888460000101
Figure BDA0002482888460000111

Claims (8)

1. an intelligent analysis method for intranet security threats is characterized by comprising the following steps:
1, calculating a multi-model inconsistency score, which comprises the following steps:
step 1.1, generating a log template set to obtain a characteristic value of a log;
step 1.2, template preprocessing, namely partitioning the log according to different events;
1.3, extracting features to generate a feature matrix;
1.4, calculating the inconsistency score of the training set;
step 1.5, carrying out inconsistency measurement on the log template set to be measured by the plurality of models respectively;
2, calculating the reliability of the model prediction result by using a statistical learning algorithm;
calculating P-Value of each model prediction result, and finally obtaining the reliability of the prediction result according to the P-Value;
3, performing cooperative defense by multiple models, and jointly predicting the malicious degree of a certain behavior;
and based on the result of the P-Value, fusing the multi-model prediction results by using a simple voting mode according to the maximum error probability given by the user, so as to achieve the cooperative defense.
2. The intelligent analysis method for intranet security threats according to claim 1, wherein the step 1.1 comprises:
1.1.1, firstly preprocessing an original log to generate a log stream set, and processing the log stream set by using a log analysis algorithm f to obtain a log template set T;
1.1.2, input of log analysis: original log stream set X, log parsing algorithm set G:
① Log stream set X containing n log blocks Xj,j∈{1,2,…,n},X={x1,…,xn};
② Log resolution Algorithm set F containing m Log resolution algorithms Fk,k∈{1,2,…,m},F={f1,…,fm}; the input of the algorithm set is a log stream set, and the log stream set is returned as a log template set T;
1.1.3, outputting m log template sets Tm,k∈{1,2,…,m},T={T1,…,Tm}; wherein each log template set TkContaining q log templates tkb,b∈{1,2,…,q},Tk={tk1,…,tkq}。
3. The intelligent analysis method for intranet security threats according to claim 2, wherein the step 1.2 comprises:
1.2.1, for a log template set generated by each heterogeneous log analysis algorithm, firstly, partitioning according to block _ id contained in the content of the log template, and then, dividing into a training set and a test set according to a partition proportion and a partition mode given by a user;
1.2.2, preprocessed input: a log template set T comprising m log template sets Tk,k∈{1,2,…,m},T={T1,…,Tm}; wherein each log template set TkContaining q log templates tkb,b∈{1,2,…,q},Tk={tk1,…,tkq};
1.2.3, output of preprocessing: log block set B containing m blocks by block _ idkK ∈ {1,2, …, m }, where each set of log blocks contains r log blocks bki,i∈{1,2,...,r},Bk={bk1,...,bkr}; and each log block set is divided into a training set CkK ∈ {1,2, ·, m }, where each training set contains h log blocks cki,i∈{1,2,...,h},Ck={ck1,...,ckh}, and test set DkK ∈ {1, 2...., m }, where each test set contains l log blocks dki,i∈{1,2,...,l},Dk={dk1,...,dkl},Bk={Ck,Dk}。
4. The intelligent analysis method for intranet security threats according to claim 3, wherein the step 1.3 comprises:
1.3.1, performing feature extraction on a training set and a test set in each heterogeneous log block set to generate a matrix for training a machine learning model;
1.3.2, input of feature extraction: training set C, testing set D:
① training set C containing m training sets Ci,i∈{1,2,...,m};
② test set D containing m test sets Di,i∈{1,2,...,m};
1.3.3, output of feature extraction: training matrix E, testing matrix F:
① training matrix E consisting of m training matrices Ei,i∈{1,2,...,m};
② test matrix F containing m training matrices Fi,i∈{1,2,...,m}。
5. The intelligent analysis method for intranet security threats according to claim 4, wherein the step 1.4 comprises:
1.4.1. Each heterogeneous training set Ci,i∈{1,2,...,m},Calculating an inconsistency score α by using a machine learning algorithm g according to the trained model;
1.4.2, input of training set inconsistency measure: training set C, machine learning algorithm, i.e. inconsistency metric function set G:
① training set C containing m training sets CiI ∈ {1, 2...., m }, where each training set contains h log blocks cki,i∈{1,2,...,h},Ck={ck1,...,ckh};
② machine learning algorithm set G comprising p log parsing algorithms Gk,k∈{1,2,…,m},G={g1,…,gpThe returned value of the algorithm is the probability that a certain log block is normal and abnormal, namely the inconsistency score;
1.4.3, output of training set inconsistency measure: a set of inconsistency scores;
1.4.4, algorithm flow:
let cki∈Ck,Ck={ck1,...,ckh},Ck∈C,C={C1,...,Cm};gj∈G,G={g1,…,gp}
Figure FDA0002482888450000031
6. The intelligent analysis method for intranet security threats according to claim 5, wherein the step 1.5 comprises:
1.5.1. Each heterogeneous test set diCalculating an inconsistency score α by using a machine learning algorithm g according to the trained model, wherein the inconsistency scores given by the heterogeneous model have no comparability, and the quality of a model prediction result cannot be directly compared according to the inconsistency scores;
1.5.2, input of test set inconsistency metric: test set D, machine learning algorithm, i.e. inconsistency metric function set G:
① test set D containing m testsTrial set DiI ∈ {1, 2...., m }, where each test set contains l log blocks dki,i∈{1,2,...,l},Dk={dk1,...,dkl};
② machine learning algorithm set G comprising p log parsing algorithms Gk,k∈{1,2,…,m},G={g1,…,gp},
The return value of the algorithm is the probability that a certain log block is normal and abnormal, namely the inconsistency score;
1.5.3, output of test set inconsistency metric: a set of inconsistency scores;
1.5.4, algorithm flow:
let dki∈Dk,Dk={dk1,...,dkh},Dk∈D,D={D1,...,Dm};gj∈G,G={g1,…,gp}
Figure FDA0002482888450000032
7. The intelligent analysis method for intranet security threats according to claim 6, wherein the 2 nd step of credibility calculation comprises the following steps:
2.1, to the feature matrix tkbIn particular, a feature vector x is generated for the original log blocks in the test set and training setjAn inconsistency measure is performed, resulting in a corresponding set of inconsistency scores { αkb_1kb_2,...,αkb_j};
2.2 Log Block y test setjDiscordance score αkb_iPut into the training set's set of inconsistency scores, such P-Value values are less than or equal to the test set Log Block yjDiscordance score αkb_1The ratio of the number of log blocks to the total number of log blocks;
the larger the P-Value of 2.3 indicates that the log block y to be tested isjThe higher the significance in this class;
2.4, input: training an inconsistency scoring set of the set log blocks;
and 2.5, outputting: log block y to be testedjP-Value of (1);
2.6, algorithm flow:
Figure FDA0002482888450000041
8. the intelligent analysis method for intranet security threats according to claim 7, wherein the step 3 comprises:
3.1, calculating P-Value values smaller than the maximum error probability for all the P-Value sets obtained in the previous step to obtain training models corresponding to the P-Value values, and if the number of the training models is smaller than the set number s of the models, predicting the log block y to be testediIs a normal log block, otherwise, the log block y to be tested is predictediIs an abnormal log block;
3.2, input: test set Log Block yiA set of P-Value values of; an acceptable maximum error probability, provided by the user, indicating a maximum error probability that the user can accept;
3.3, output: predicting the result;
3.4, algorithm flow:
Figure FDA0002482888450000042
Figure FDA0002482888450000051
CN202010382950.0A 2020-05-08 2020-05-08 Credibility-based multi-model cooperative defense method for internal network security threats Pending CN111565192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010382950.0A CN111565192A (en) 2020-05-08 2020-05-08 Credibility-based multi-model cooperative defense method for internal network security threats

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010382950.0A CN111565192A (en) 2020-05-08 2020-05-08 Credibility-based multi-model cooperative defense method for internal network security threats

Publications (1)

Publication Number Publication Date
CN111565192A true CN111565192A (en) 2020-08-21

Family

ID=72073245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010382950.0A Pending CN111565192A (en) 2020-05-08 2020-05-08 Credibility-based multi-model cooperative defense method for internal network security threats

Country Status (1)

Country Link
CN (1) CN111565192A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732655A (en) * 2021-01-13 2021-04-30 北京六方云信息技术有限公司 Online analysis method and system for unformatted logs

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification
CN106878314A (en) * 2017-02-28 2017-06-20 南开大学 Network malicious act detection method based on confidence level
WO2018159362A1 (en) * 2017-03-03 2018-09-07 日本電信電話株式会社 Log analysis apparatus, log analysis method, and log analysis program
CN110011990A (en) * 2019-03-22 2019-07-12 南开大学 Intranet security threatens intelligent analysis method
CN110213287A (en) * 2019-06-12 2019-09-06 北京理工大学 A kind of double mode invasion detecting device based on ensemble machine learning algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification
CN106878314A (en) * 2017-02-28 2017-06-20 南开大学 Network malicious act detection method based on confidence level
WO2018159362A1 (en) * 2017-03-03 2018-09-07 日本電信電話株式会社 Log analysis apparatus, log analysis method, and log analysis program
CN110011990A (en) * 2019-03-22 2019-07-12 南开大学 Intranet security threatens intelligent analysis method
CN110213287A (en) * 2019-06-12 2019-09-06 北京理工大学 A kind of double mode invasion detecting device based on ensemble machine learning algorithm

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
OLA SPJUTH,ET.AL: "《Aggregating Predictions on Multiple Non-disclosed Datasets using Conformal Prediction》", 《ELSEVIER》 *
YITONG REN,ET.AL: "《System Log Detection Model Based on Conformal Prediction》", 《ELECTRONICS》 *
张永生等: "《基于可信度的Android恶意代码多模型协同检测方法》", 《广西师范大学学报(自然科学版)》 *
顾兆军等: "《基于一致性预测算法的内网日志检测模型》", 《技术研究》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732655A (en) * 2021-01-13 2021-04-30 北京六方云信息技术有限公司 Online analysis method and system for unformatted logs
CN112732655B (en) * 2021-01-13 2024-02-06 北京六方云信息技术有限公司 Online analysis method and system for format-free log

Similar Documents

Publication Publication Date Title
CN111428054B (en) Construction and storage method of knowledge graph in network space security field
CN107992746B (en) Malicious behavior mining method and device
CN107391353B (en) Method for detecting abnormal behavior of complex software system based on log
CN111639497B (en) Abnormal behavior discovery method based on big data machine learning
CN111047173B (en) Community credibility evaluation method based on improved D-S evidence theory
CN110011990B (en) Intelligent analysis method for intranet security threats
CN111143838B (en) Database user abnormal behavior detection method
Maakoul et al. Towards evaluating the COVID’19 related fake news problem: case of morocco
Rahman et al. New biostatistics features for detecting web bot activity on web applications
CN114036531A (en) Multi-scale code measurement-based software security vulnerability detection method
CN117195250A (en) Data security management method and system
Wang et al. An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning
CN110674288A (en) User portrait method applied to network security field
CN111565192A (en) Credibility-based multi-model cooperative defense method for internal network security threats
Singh et al. User behaviour based insider threat detection in critical infrastructures
Liang et al. Automatic security classification based on incremental learning and similarity comparison
Li et al. Glad: Content-aware dynamic graphs for log anomaly detection
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
Chen et al. Unsupervised Anomaly Detection Based on System Logs.
CN114528908A (en) Network request data classification model training method, classification method and storage medium
Li et al. Learned bloom filter for multi-key membership testing
Zhao et al. TAElog: A Novel Transformer AutoEncoder-Based Log Anomaly Detection Method
Dong et al. Security Situation Assessment Algorithm for Industrial Control Network Nodes Based on Improved Text SimHash
Pokharel Information Extraction Using Named Entity Recognition from Log Messages
CN113516189B (en) Website malicious user prediction method based on two-stage random forest algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200821