CN112711665B - Log anomaly detection method based on density weighted integration rule - Google Patents

Log anomaly detection method based on density weighted integration rule Download PDF

Info

Publication number
CN112711665B
CN112711665B CN202110063328.8A CN202110063328A CN112711665B CN 112711665 B CN112711665 B CN 112711665B CN 202110063328 A CN202110063328 A CN 202110063328A CN 112711665 B CN112711665 B CN 112711665B
Authority
CN
China
Prior art keywords
classification
cluster
rule
csnum
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110063328.8A
Other languages
Chinese (zh)
Other versions
CN112711665A (en
Inventor
应时
刘祥瑞
王冰明
黄浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110063328.8A priority Critical patent/CN112711665B/en
Publication of CN112711665A publication Critical patent/CN112711665A/en
Application granted granted Critical
Publication of CN112711665B publication Critical patent/CN112711665B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a log anomaly detection method based on a density weighted integration rule. The method introduces a plurality of software logs, and constructs word frequency vectors according to the software logs; obtaining normal clusters and abnormal clusters by using an improved spectral clustering method according to the word frequency vectors, calculating to obtain a normal log set and an abnormal log set, and constructing a balanced log set; the base classifier takes the balance log set as a training set, a multi-base classifier is constructed by using the trained base classifier, samples to be classified are classified by using the multi-base classifier, and the base classifier generates a classification probability vector; and according to the classification probability vector, respectively obtaining five classification results through five new integration rules, and selecting the classification result with the maximum frequency as a final classification result. The invention has the advantages that the balance of the samples is ensured, the distribution of the original data is considered, the new integration rule also considers the relation between the samples to be classified and the historical data, and the accuracy of the classification result is improved.

Description

Log anomaly detection method based on density weighted integration rule
Technical Field
The invention belongs to the field of log anomaly detection, and particularly relates to a log anomaly detection method based on a density weighted integration rule.
Background
Modern systems are evolving on a large scale, either horizontally, into complex systems built on thousands of commercial machines (e.g., Spark); or it is a vertical extension to a supercomputer with thousands of processors (such as Blue Gene/L). These systems are becoming a core part of the IT industry, and the occurrence of faults and their impact on system performance and operating costs has become a very important issue in the field of research. Complex software and systems not only contain more BUGs, but are also difficult to understand and analyze. In addition, the quality of these systems is also aging over time. These problems can lead to software crashes or system downtime.
The log may be used to obtain software information to detect and locate anomalies. Conventional system administrators examine the log data generated by the system to gain insight into the behavior of the system. However, due to the increasing size and complexity of the system, a large number of logs are generated each day. If a problem occurs, it is very time consuming for an operator to find a system problem by manually checking a large number of log messages. Therefore, the need for some automated tools for log anomaly detection is increasing.
In the log data, a normal log records a normal state of a system or software, an abnormal log records an abnormal state of the system or software, and the number of logs describing the normal state is much larger than that describing the abnormal state of the system, so unbalanced data distribution is a characteristic of the log data. Today's standard machine learning algorithms are based on the theory of balanced data, which usually perform poorly on unbalanced samples. Classifiers based on traditional machine learning algorithms typically ignore a small number of classes because these classification algorithms tend to maximize the accuracy of the overall classification. Therefore, their accuracy is not good enough for classifying the imbalance problem. By combining the classification results of multiple base classifiers, the problem can be solved by ensemble learning. However, each result of the base classifier is not accurate because it is still processing unbalanced data that is proportionally sampled from the original unbalanced data. Therefore, this problem is solved by combining some special sampling methods with the bag of words, these sampling methods include underwbagging, XGBoost and SMOTE-Bagging. These ensemble learning methods process unbalanced data into balanced samples using sampling, and a base classifier is used to balance the samples, producing a plurality of classification results. These classification results are then merged using specific integration rules. However, when an ensemble learning based method is used to detect anomalies on log data, there are two problems here:
unbalanced sample handling problems. In general, the ensemble learning method uses Bootstrap to randomly sample and obtain a balanced data set, which is a sampling method with a put-back. It will change the distribution of the raw data or overfit the classifier. Therefore, when the base classifier is used for samples obtained by these sampling methods, there still remains a problem of low accuracy.
An integration rule problem. There are five traditional integration rules, which are Max Rule, Min Rule, Product Rule, priority Rule, and Sum Rule. However, the samples to be classified are usually most relevant to the samples of a particular class in the historical data, and these traditional integration rules only merge all classification results. The accuracy of ensemble learning based anomaly detection can be improved if the relationship between the sample to be classified and the historical data can be taken into account.
Disclosure of Invention
Aiming at the research background and the problems, the invention provides a log anomaly detection method based on a density weighted integration rule.
Step 1: introducing a plurality of software logs, segmenting and analyzing each software log according to separators to obtain a software word data set, performing union processing on a plurality of software word data sets, further performing word de-duplication processing to obtain a word set, counting the frequency of each word in the word set in each log, and further constructing a software log word frequency vector;
step 2: according to the word frequency vector, extracting and determining an initial central point and an initial class number by using a multi-granularity master curve method based on improved complex distribution data spectral clustering, clustering the word frequency vectors and obtaining accurate clusters, and simultaneously obtaining the central point of each cluster, marking all samples in the cluster according to the state of the central point of each cluster, determining the states of all samples according to the state of the central point to obtain a normal cluster and an abnormal cluster, and counting the number of abnormal clusters, counting the number of samples of the abnormal clusters, calculating to obtain the number of samples of a new normal cluster, sampling the normal clusters to obtain new normal clusters, calculating the number of samples of the new abnormal clusters according to the number of samples of the new normal clusters, sampling the abnormal clusters to obtain new abnormal clusters, obtaining a normal log set and an abnormal log set, and constructing a balanced log set through the normal log set and the abnormal log set;
and step 3: the method comprises the following steps that a base classifier takes a balance log set as a training set to carry out optimization training, a multi-base classifier is constructed by using the trained base classifier, samples to be classified are classified by using the multi-base classifier, and each base classifier of the multi-base classifier can generate a classification probability vector;
and 4, step 4: according to the classification probability vector generated by each base classifier of the multi-base classifier, classification results generated by five integration rules are respectively obtained through five integration rules of MaxNW, MinNW, MajNW, ProdNW and SumNW, the five classification results are traversed, if the same classification results exist, the frequency of the classification results is increased by one, the frequency of the classification results is obtained, the classification result with the maximum frequency is selected as the final classification result, and if a plurality of classification results with the maximum frequency exist, one classification result is randomly selected as the final classification result;
preferably, each software log in step 1 is:
Logi
i∈[1,M]
wherein, LogiThe number of the ith software log is M, and the M is the number of the software logs;
step 1, the software word data set comprises:
Figure BDA0002903523190000031
wherein DataiSoftware Word data set, Word, for the ith software logi,jIs the jth software word, N, in the software word dataset of the ith software logiThe number of software words in the software word data set for the ith software log, j belongs to [1, N ]i];
Step 1, the process of solving union set of a plurality of software word data sets comprises the following steps: { Data1,Data2,...,DataM};
Step 1, the word set obtained by the word de-duplication process is:
WordSet={Word1,Word2,...,WordL}
wherein, WordkIs the k-th word in the word set, L is the number of words in the word set, k is [1, L];
Step 1, the frequency of each word in the statistical word set in each log is:
Freqk={Fk,1,Fk,2,...,Fk,M}
wherein, FreqkFor the frequency of occurrence of the kth word in the word set in each software word data set, Freqk,iThe frequency, namely the frequency of occurrence of the k word in the word set in the software word data set of the ith software log is shown, L is the number of the words in the word set, k belongs to [1, L ∈];
Step 1, constructing a software word frequency vector comprises the following steps:
Vectori={F1,i,F2,i,...,FL,i}
i∈[1,M]
wherein, VectoriThe word frequency vector of the ith software log is obtained, and M is the number of the software logs;
preferably, the initial central point and the class in step 2 are respectively:
CenterPoint0
Classes={Class1,Class2,...,ClassClaNum}
wherein, CenterPoint0As an initial center point, ClassclassnumIs a ClaNum class, ClaNum belongs to [1, ClaNum];
The precise clusters in step 2 are:
Clusters={Cluster1,Cluster2,...,ClusterCluNum}
wherein, ClusterclunumIs the first cluster, which belongs to [1, CluNum ∈ ]];
Step 2, the central point of each cluster is as follows:
CenterPoints={CenterPoint0,CenterPoint1,...,CenterPointCluNum}
wherein, CenterPointclunumIs the center point of the first cluster, in particular, CenterPoint0Is an initial center point;
step 2, the state of the central point of each cluster is as follows:
CenterPointStates={CPState1,CPState2,...,CPStateCluNum}
wherein, CPStateclunumIs the state of the center point of the first cluster, CPStateclunum∈[0,CluNum-1],CPStateclunum0 indicates that the center point of the cluster is normal, CPStateclunumNot equal to 0, the central point of the first cluster is abnormal, the normal state is only 1, and the abnormal state is ClusterNum-1;
step 2, the samples in the cluster are:
Figure BDA0002903523190000041
wherein, ClusterclunumIs the first cluster, Sampleclunum,samnumSamnum sample of the first cluster, SamnumclunumAs the number of samples of the cluster number of the cluster, SamNum ═ 1, SamNumclunum];
Step 2, determining the states of all samples according to the states of the central points as follows:
Figure BDA0002903523190000042
among them, SamStatesclunumSample State set for the clusum cluster, SamStateclunum,samnum(iii) the samnum sample status of the first cluster, and CPStateclunumSimilarly, there is a relationship between the two:
if CPStateclunum=0
Figure BDA0002903523190000056
if the state of the central point of the first cluster is normal, all the states of the samples in the first cluster are normal, and the first cluster is a normal cluster;
if CPStateclunum=x
Figure BDA0002903523190000051
(x∈[1,ClusterNum-1])
if the state of the central point of the first cluster is abnormal, all the states of the samples in the first cluster are abnormal, and the first cluster is an abnormal cluster;
step 2, the normal cluster is:
NorClusters0={NorCluster0,1}
among them, NorClusters0Indicating normal clustering, NorCluster0,1The number of the normal clusters is only 1 in the 1 st normal cluster of the normal cluster set;
step 2, the abnormal cluster is as follows:
Figure BDA0002903523190000052
wherein, AbnorClusters0A cluster of anomalies is represented that is a cluster of anomalies,
Figure BDA0002903523190000057
n-th of abnormal clusterabnorAn abnormal cluster, nabnor∈[1,Nabnor],NabnorThe number of abnormal clusters;
the number of the samples of the abnormal cluster in the step 2 is as follows:
Figure BDA0002903523190000053
wherein,
Figure BDA0002903523190000054
is n thabnorThe number of samples of an outlier cluster;
the number of samples of the new normal cluster in the step 2 is as follows:
Figure BDA0002903523190000055
the number of samples of the new abnormal cluster in the step 2 is as follows:
Figure BDA0002903523190000061
extracting N from each normal clusternorSamples, resulting in new normal clusters:
NorClusters1={NorCluster1,1}
among them, NorClusters1For normal clustering with 1 round of sampling, NorCluster1,1Repeating the process for the 1 st normal cluster of the normal clusters after 1 round of sampling;
extracting N from each abnormal clusterabnorAnd (4) obtaining a new abnormal cluster by each sample:
Figure BDA0002903523190000062
wherein, AbnorClusters1For an abnormal cluster over 1 round of sampling,
Figure BDA0002903523190000063
for the n-th abnormal cluster of 1-round samplingabnorAn abnormal cluster, nabnor∈[1,Nabnor],NabnorThis process was repeated N times for the number of abnormal clusters sampled through 1 round;
step 2, the normal log sample set is as follows:
NorSet={NorClusters1,NorClusters2,...,NorClustersN}
among them, NorClustersnFor normal clustering after N times of sampling, N belongs to [1, N ]];
Step 2, the abnormal log sample set is as follows:
AbnorSet={AbnorClusters1,AbnorClusters2,...,AbnorClustersN}
wherein, AbnorClustersnThe abnormal cluster is obtained after n times of sampling;
step 2, the balance log set is as follows:
BalanceSet={BS1,BS2,...,BSN}
wherein, BSnFor balanced log sets after n samples, BSn={AbnorClustersn,NorClustersn}。
Preferably, the base classifier in step 3 is:
MulClassifier={CS1,CS2,...,CSCSNum}
wherein, the MulClassiier is a multi-base classifier, CScsnumIs a CSNum base classifier, CSNum belongs to [1, CSnum ∈];
Step 3, the samples to be classified:
S
step 3, the classification probability vector generated by each base classifier of the multi-base classifier is as follows:
CScsnum={probcsnum,1*|MKNN(S,1)|,probcsnum,2*|MKNN(S,2)|,...,probcsnum,ClaNum*|MKNN(S,ClaNum)|}
wherein, probcsnum,clanumIs the probability that the csnum base classifiers classify the sample S to be classified into the ClaNum class, MKNN (S, ClaNum) (classnum belongs to [1, ClaNum ∈)]) Representing the mutual neighbor samples of the sample S to be classified in the sample cluster of the first class, | MKNN (S, class) | representing the number of the mutual neighbor samples of the sample S to be classified in the sample cluster of the first class;
preferably, the MaxNW rule in step 4 is:
obtaining a classification probability vector generated by each base classifier of the multi-base classifiers, obtaining the respective maximum classification probability of each base classifier from the generated classification probability vectors, splicing the maximum classification probabilities into a maximum classification probability set, selecting the maximum classification probability from the maximum classification probability set, and searching to obtain a class corresponding to the maximum classification probability, namely a classification result under a MaxNW rule;
the MaxNW rule describes that the respective maximum classification probability is:
max(CScsnum)
wherein CScsnumIs a CSNum base classifier, CSNum belongs to [1, CSnum ∈];
The maximum classification probability set of the MaxNW rule is:
MaxPro={max(CS1),max(CS2),...,max(CSCsNum)}
wherein, max (CS)csnum) The maximum probability of the csnum base classifiers;
the maximum classification probability in the maximum classification probability set of the MaxNW rule is:
max(MaxPro)
the maximum classification probability obtained by the search according to the MaxNW rule is as follows:
Figure BDA0002903523190000071
wherein,
Figure BDA0002903523190000072
sorting samples S to be sorted to the a-th for csnum base classifiers1Probability of individual class, a1∈[1,ClaNum];
MaxNW rule the classification result under the MaxNW rule is:
Figure BDA0002903523190000073
step 4, the MinNW rule is:
obtaining a classification probability set generated by each base classifier of the multiple base classifiers, obtaining the respective minimum classification probability of each base classifier from the generated classification probability vectors, splicing the minimum classification probabilities into a minimum classification probability set, selecting the minimum classification probability from the minimum classification probability set, and searching to obtain a class corresponding to the minimum classification probability, namely a classification result under a MinNW rule;
the MinNW rule states that the respective minimum classification probabilities are:
min(CScsnum)
wherein CScsnumIs a CSNum base classifier, CSNum belongs to [1, CSnum ∈];
The MinNW rule the minimum set of classification probabilities is:
MinPro={min(CS1),min(CS2),...,min(CSCSNum)}
wherein, min (CS)csnum) The minimum probability of a csnum base classifier;
the MinNW rule specifies the minimum classification probability in the minimum classification probability set as:
min(MinPro)
the minimum classification probability obtained by the search according to the MinNW rule is:
Figure BDA0002903523190000081
wherein,
Figure BDA0002903523190000082
the csnum base classifiers classify the samples S to be classified into a2Probability of individual class, a2∈[1,ClaNum];
MinNW rule the classification results under the MinNW rule are:
Figure BDA0002903523190000083
step 4, the MajNW rule is:
obtaining a classification probability vector generated by each base classifier of the multiple base classifiers, obtaining the respective maximum classification probability of each base classifier from the generated classification probability vectors, splicing the maximum classification probabilities into a maximum classification probability set, counting the frequency of occurrence of the class corresponding to each maximum classification probability in the maximum classification probability set to obtain a frequency set, and searching the frequency set again to obtain the class corresponding to the maximum frequency, namely the classification result under the MajNW rule;
the frequency set of the MajNW rule is:
Count={count1,count2,...,countClaNum}
wherein, countclanumFor the frequency of occurrence of the ClaNum class in the maximum class probability set, ClaNum belongs to [1, ClaNum];
MajNW rule the maximum frequency is:
max(Count)
the maximum frequency obtained by the search in the MajNW rule is:
Figure BDA0002903523190000091
MajNW rule the classification result under the MajNW rule is:
Figure BDA0002903523190000092
step 4, the ProdNW rule is as follows:
obtaining a classification probability vector generated by each base classifier of the multi-base classifiers, obtaining a classification probability set of each class according to the classification probability vector, obtaining a whole classification probability set according to the classification probability set, sequentially multiplying the classification probabilities of the classification probability sets of each class to obtain a product of the classification probability of each class, obtaining a set of products of the classification probabilities according to the product, obtaining a maximum product of the classification probability from the set, and searching the class corresponding to the maximum product of the classification probability, namely a classification result under ProdNW rules;
ProdNW rule the entire set of classification probabilities:
ClassProbability={CP1,CP2,...,CPClaNum}
wherein, CPclanumIs a set of classification probabilities for the ClaNum classes, ClaNum belongs to [1, ClaNum];
The ProdNW rule sets the classification probability of each class as:
CPclanum={prob1,clanum,prob2,clanum,...,probCSNum,clanum}
wherein, probcsnum,clanumProbability of classifying a sample S to be classified into a first class for a csnum classifier;
the ProdNW rule states that the product of the classification probabilities for each class is:
Figure BDA0002903523190000093
the set of products of the ProdNW rule classification probabilities is:
ProdPro={produce1,produce2,...,produceClaNum}
the ProdNW rule states that the product of the maximum classification probabilities is:
max(ProdPro)
ProdNW rule the product of the largest classification probabilities obtained by the search is:
Figure BDA0002903523190000094
ProdNW rules the classification results under the rules are:
Figure BDA0002903523190000095
step 4 the SumNW rule is:
obtaining a classification probability vector generated by each base classifier of a multi-base classifier, obtaining a classification probability set of each class according to the classification probability vector, obtaining a whole classification probability set according to the classification probability set, sequentially adding the classification probabilities of the classification probability sets of each class to obtain a sum of the classification probabilities of each class, obtaining a set of the sums of the classification probabilities according to the sum of the classification probabilities, obtaining the maximum sum of the classification probabilities from the set of the classification probabilities, and searching the class corresponding to the maximum sum of the classification probabilities, wherein the class is a classification result under the SumNW rule;
the sum of the classification probabilities for each class by the SumNW rule is:
Figure BDA0002903523190000101
the set of products of the SumNW rule classification probabilities is:
SumPro={sum1,sum2,...,sumClaNum}
the sum nw rule states that the product of the maximum classification probabilities is:
max(SumPro)
SumNW rule the search results in the largest product of classification probabilities:
Figure BDA0002903523190000102
SumNW rules the classification results under the rules are:
Figure BDA0002903523190000103
step 4, the classification results generated by the five integration rules are as follows:
Results={Result1,Result2,...,Result5}
wherein Results is the classification Result set, ResultrFor the classification result obtained by the r-th integration rule, r is the [1, 5 ]]The integration rule sequence is: MaxNW, MinNW, MajNW, ProdNW and SumNW;
and 4, the frequency of the classification result is as follows:
ResNums={ResNum1,ResNum2,...,ResNumClaNum}
wherein ResNums is the product of integration ruleNumber set of raw classification results, ResNumclanumIndicating the frequency of the appearance of the first class in the classification result set Results;
and 4, the classification result with the maximum frequency is as follows:
MaxResults={MaxResult1,MaxResult2,...,MaxResultMR}
wherein, MaxResults is the classification result set with the largest frequency, MaxResultmrRepresents the MR-th classification result with the maximum frequency, and MR belongs to [1, MR ∈]MR is more than or equal to 1 and less than or equal to 5, if MRCN is 1, the final classification result is MaxResult1If MRCN is not equal to 1, randomly selecting a class corresponding to the maximum number from the set as a final classification result;
and 4, the final classification result is as follows:
random=rand(1,MR)
FinalResult=Resultrandom
wherein random ═ rand (1, MR) is from [1, MR]Randomly selects an integer in the interval of (1), Final Result ═ ResultrandomThe random classification result is selected as the final classification result.
The invention has the advantages that the balance of the samples is ensured, the distribution of the original data is considered, the new set rule also considers the relation between the samples to be classified and the historical data, and the accuracy of the classification result is improved.
Drawings
FIG. 1: step 1 of the invention is a flow chart;
FIG. 2: step 2 of the invention is a flow chart;
FIG. 3: step 3 of the invention is a flow chart;
FIG. 4: step 4 of the invention is a flow chart;
FIG. 5: method flow chart of the invention
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
The following describes an embodiment of the present invention with reference to fig. 1 to 5, which is a log anomaly detection method based on density weighted integration rules, and specifically includes the following steps:
step 1: introducing a plurality of software logs, segmenting and analyzing each software log according to separators to obtain a software word data set, performing union processing on a plurality of software word data sets, further performing word de-duplication processing to obtain a word set, counting the frequency of each word in the word set in each log, and further constructing a software log word frequency vector, as shown in fig. 1.
Step 1, each software log is as follows:
Logi
i∈[1,M]
wherein, LogiThe number of the ith software log is M, and the M is the number of the software logs; m1024;
step 1, the software word data set comprises:
Figure BDA0002903523190000121
wherein DataiSoftware Word data set, Word, for the ith software logi,jIs the jth software word, N, in the software word dataset of the ith software logiThe number of software words in the software word data set for the ith software log, j belongs to [1, N ]i];Ni=50;
Step 1, the process of solving union set of a plurality of software word data sets comprises the following steps: { Data1,Data2,...,DataM};
Step 1, the word set obtained by the word de-duplication process is:
WordSet={Word1,Word2,...,WordL}
wherein, WordkIs the k-th word in the word set, L is the number of words in the word set, k is [1, L];L=1024;
Step 1, the frequency of each word in the statistical word set in each log is:
Freqk={Fk,1,Fk,2,...,Fk,M}
wherein, FreqkFor the frequency of occurrence of the kth word in the word set in each software word data set, Freqk,iThe frequency, namely the frequency of occurrence of the k word in the word set in the software word data set of the ith software log is shown, L is the number of the words in the word set, k belongs to [1, L ∈];
Step 1, constructing a software word frequency vector comprises the following steps:
Vectori={F1,i,F2,i,...,FL,i}
i∈[1,M]
wherein, VectoriThe word frequency vector of the ith software log is obtained, and M is the number of the software logs;
step 2: after the word frequency vectors obtained in the step 1 are obtained, an initial central point and an initial class number are extracted and determined by using a multi-granularity master curve method based on improved complex distribution data spectral clustering, the word frequency vectors are clustered to obtain accurate clusters, the central point of each cluster is obtained at the same time, all samples in the clusters are marked according to the state of the central point of each cluster, the states of all samples are determined according to the state of the central point to obtain normal clusters and abnormal clusters, the number of the abnormal clusters is counted, the number of the samples of the new normal clusters is calculated, the new normal clusters are obtained by sampling the normal clusters, the number of the samples of the new abnormal clusters is calculated through the number of the samples of the new normal clusters, the new abnormal clusters are obtained by sampling the abnormal clusters, a normal log set and an abnormal log set are obtained, a balanced log set is constructed through the normal log set and the abnormal log set, as shown in fig. 2.
Step 2, the initial central points and the classes are respectively:
CenterPoint0
Classes={Class1,Class2,...,ClassClaNum}
wherein, CenterPoint0As an initial center point, ClassclassnumIs a ClaNum class, ClaNum belongs to [1, ClaNum];ClaNum=15;
The precise clusters in step 2 are:
Clusters={Cluster1,Cluster2,...,ClusterCluNum}
wherein, ClusterclunumIs the first cluster, which belongs to [1, CluNum ∈ ]];CluNum=15;
Step 2, the central point of each cluster is as follows:
CenterPoints={CenterPoint0,CenterPoint1,...,CenterPointCluNum}
wherein, CenterPointclunumIs the center point of the first cluster, in particular, CenterPointoIs an initial center point;
step 2, the state of the central point of each cluster is as follows:
CenterPointStates={CPState1,CPState2,...,CPStateCluNum}
wherein, CPStateclunumIs the state of the center point of the first cluster, CPStateclunum∈[0,CluNum-1],CPStateclunum0 indicates that the center point of the cluster is normal, CPStateclunumNot equal to 0, the central point of the first cluster is abnormal, the normal state is only 1, and the abnormal state is ClusterNum-1;
step 2, the samples in the cluster are:
Figure BDA0002903523190000131
wherein, ClusterclunumIs the first cluster, Sampleclunum,samnumSamnum sample of the first cluster, SamnumclunumAs the number of samples of the cluster number of the cluster, SamNum ═ 1, SamNumclunum];
Step 2, determining the states of all samples according to the states of the central points as follows:
Figure BDA0002903523190000132
among them, SamStatesclunumSample State set for the clusum cluster, SamStateclunum,samnum(iii) the samnum sample status of the first cluster, and CPStateclunumSimilarly, there is a relationship between the two:
if CPStateclunum=0
Figure BDA0002903523190000141
if the state of the central point of the first cluster is normal, all the states of the samples in the first cluster are normal, and the first cluster is a normal cluster;
if CPStateclunum=x
Figure BDA0002903523190000142
if the state of the central point of the first cluster is abnormal, all the states of the samples in the first cluster are abnormal, and the first cluster is an abnormal cluster;
step 2, the normal cluster is:
NorClusters0={NorCluster0,1}
among them, NorClusters0Indicating normal clustering, NorCluster0,1The number of the normal clusters is only 1 in the 1 st normal cluster of the normal cluster set;
step 2, the abnormal cluster is as follows:
Figure BDA0002903523190000143
wherein, AbnorClusters0A cluster of anomalies is represented that is a cluster of anomalies,
Figure BDA0002903523190000144
n-th of abnormal clusterabnorAn abnormal cluster, nabnor∈[1,Nabnor],NabnorThe number of abnormal clusters; n is a radical ofabnor=14;
The number of the samples of the abnormal cluster in the step 2 is as follows:
Figure BDA0002903523190000145
wherein,
Figure BDA0002903523190000146
is n thabnorThe number of samples of an outlier cluster;
the number of samples of the new normal cluster in the step 2 is as follows:
Figure BDA0002903523190000151
the number of samples of the new abnormal cluster in the step 2 is as follows:
Figure BDA0002903523190000152
extracting N from each normal clusternorSamples, resulting in new normal clusters:
NorClusters1={NorCluster1,1}
among them, NorClusters1For normal clustering with 1 round of sampling, NorCluster1,1Repeating the process for the 1 st normal cluster of the normal clusters after 1 round of sampling;
extracting N from each abnormal clusterabnorAnd (4) obtaining a new abnormal cluster by each sample:
Figure BDA0002903523190000153
wherein, AbnorClusters1For an abnormal cluster over 1 round of sampling,
Figure BDA0002903523190000154
for the n-th abnormal cluster of 1-round samplingabnorAn abnormal cluster, nabnor∈[1,Nabnor],NabnorThis process was repeated N times for the number of abnormal clusters sampled through 1 round;
step 2, the normal log sample set is as follows:
NorSet={NorClusters1,NorClusters2,...,NorClustersN}
among them, NorClustersnFor normal clustering after N times of sampling, N belongs to [1, N ]];N=10;
Step 2, the abnormal log sample set is as follows:
AbnorSet={AbnorClusters1,AbnorClusters2,...,AbnorClustersN}
wherein, AbnorClustersnThe abnormal cluster is obtained after n times of sampling;
step 2, the balance log set is as follows:
BalanceSet={BS1,BS2,...,BSN}
wherein, BSnFor balanced log sets after n samples, BSn={AbnorClustersn,NorClustersn}。
And step 3: the base classifier takes the balance log set as a training set to perform optimization training, the trained base classifier is used for constructing a multi-base classifier, the multi-base classifier is used for classifying samples to be classified, and each base classifier of the multi-base classifier can generate a classification probability vector as shown in fig. 3.
Step 3, the base classifier is as follows:
MulClassifier={CS1,CS2,...,CSCSNum}
wherein, the MulClassiier is a multi-base classifier, CScsnumIs a csnum base classifier,csnum∈[1,CSNum];CSNum=6;
step 3, the samples to be classified:
S
step 3, the classification probability vector generated by each base classifier of the multi-base classifier is as follows:
CScsnum={probcsnum,1*|MKNN(S,1)|,probcsnum,2*|MKNN(S,2)|,...,probcsnum,ClaNum*|MKNN(S,ClaNum)|}
wherein, probcsnum,clanumIs the probability that the csnum base classifiers classify the sample S to be classified into the ClaNum class, MKNN (S, ClaNum) (classnum belongs to [1, ClaNum ∈)]) Representing the mutual neighbor samples of the sample S to be classified in the sample cluster of the first class, | MKNN (S, class) | representing the number of the mutual neighbor samples of the sample S to be classified in the sample cluster of the first class;
and step 3: according to the classification probability vector generated by each base classifier of the multi-base classifier, classification results generated by five integration rules are respectively obtained through five integration rules of MaxNW, MinNW, MajNW, ProdNW and SumNW, the five classification results are traversed, if the same classification results exist, the frequency of the classification results is increased by one, the frequency of the classification results is obtained, the classification result with the maximum frequency is selected as the final classification result, and if a plurality of classification results with the maximum frequency exist, one classification result is randomly selected as the final classification result;
step 4, the MaxNW rule is: obtaining a classification probability vector generated by each base classifier of the multi-base classifiers, obtaining the respective maximum classification probability of each base classifier from the generated classification probability vectors, splicing the maximum classification probabilities into a maximum classification probability set, selecting the maximum classification probability from the maximum classification probability set, and searching to obtain a class corresponding to the maximum classification probability, namely a classification result under a MaxNW rule;
the MaxNW rule describes that the respective maximum classification probability is:
max(CScsnum)
wherein CScsnumIs a CSNum base classifier, CSNum belongs to [1, CSnum ∈];
The maximum classification probability set of the MaxNW rule is:
MaxPro={max(CS1),max(CS2),...,max(CSCSNum)}
wherein, max (CS)csnum) The maximum probability of the csnum base classifiers;
the maximum classification probability in the maximum classification probability set of the MaxNW rule is:
max(MaxPro)
the maximum classification probability obtained by the search according to the MaxNW rule is as follows:
Figure BDA0002903523190000171
wherein,
Figure BDA0002903523190000172
sorting samples S to be sorted to the a-th for csnum base classifiers1Probability of individual class, a1∈[1,ClaNum];
MaxNW rule the classification result under the MaxNW rule is:
Figure BDA0002903523190000173
step 4, the MinNW rule is: obtaining a classification probability set generated by each base classifier of the multiple base classifiers, obtaining the respective minimum classification probability by each base classifier from the generated classification probability vectors, splicing the minimum classification probabilities into a minimum classification probability set, selecting the minimum classification probability from the minimum classification probability set, and searching to obtain a class corresponding to the minimum classification probability, namely a classification result under a MinNW rule, as shown in FIG. 4.
The MinNW rule states that the respective minimum classification probabilities are:
min(CScsnum)
wherein CScsnumIs a CSNum base classifier, CSNum belongs to [1 ], CSNum];
The MinNW rule the minimum set of classification probabilities is:
MinPro={min(CS1),min(CS2),...,min(CSCSNum)}
wherein, min (CS)csnum) The minimum probability of a csnum base classifier;
the MinNW rule specifies the minimum classification probability in the minimum classification probability set as:
min(MinPro)
the minimum classification probability obtained by the search according to the MinNW rule is:
Figure BDA0002903523190000174
wherein,
Figure BDA0002903523190000175
the csnum base classifiers classify the samples S to be classified into a2Probability of individual class, a2∈[1,ClaNum];
MinNW rule the classification results under the MinNW rule are:
Figure BDA0002903523190000176
step 4, the MajNW rule is: obtaining a classification probability vector generated by each base classifier of the multiple base classifiers, obtaining the respective maximum classification probability of each base classifier from the generated classification probability vectors, splicing the maximum classification probabilities into a maximum classification probability set, counting the frequency of occurrence of the class corresponding to each maximum classification probability in the maximum classification probability set to obtain a frequency set, and searching the frequency set again to obtain the class corresponding to the maximum frequency, namely the classification result under the MajNW rule;
the frequency set of the MajNW rule is:
Count={count1,count2,...,countClaNum}
wherein, countclanumFor the frequency of occurrence of the ClaNum class in the maximum class probability set, ClaNum belongs to [1, ClaNum];
MajNW rule the maximum frequency is:
max(Count)
the maximum frequency obtained by the search in the MajNW rule is:
Figure BDA0002903523190000181
MajNW rule the classification result under the MajNW rule is:
Figure BDA0002903523190000182
step 4, the ProdNW rule is as follows: obtaining a classification probability vector generated by each base classifier of the multi-base classifiers, obtaining a classification probability set of each class according to the classification probability vector, obtaining a whole classification probability set according to the classification probability set, sequentially multiplying the classification probabilities of the classification probability sets of each class to obtain a product of the classification probability of each class, obtaining a set of products of the classification probabilities according to the product, obtaining a maximum product of the classification probability from the set, and searching the class corresponding to the maximum product of the classification probability, namely a classification result under ProdNW rules;
ProdNW rule the entire set of classification probabilities:
ClassProbability={CP1,CP2,...,CPClaNum}
wherein, CPclanumIs a set of classification probabilities for the ClaNum classes, ClaNum belongs to [1, ClaNum];
The ProdNW rule sets the classification probability of each class as:
CPclanum={prob1,clanum,prob2,clanum,...,probCSNum,clanum}
wherein, probcsnum,clanumProbability of classifying a sample S to be classified into a first class for a csnum classifier;
the ProdNW rule states that the product of the classification probabilities for each class is:
Figure BDA0002903523190000191
the set of products of the ProdNW rule classification probabilities is:
ProdPro={produce1,produce2,...,produceClaNum}
the ProdNW rule states that the product of the maximum classification probabilities is:
max(ProdPro)
ProdNW rule the product of the largest classification probabilities obtained by the search is:
Figure BDA0002903523190000192
ProdNW rules the classification results under the rules are:
Figure BDA0002903523190000193
step 4 the SumNW rule is: obtaining a classification probability vector generated by each base classifier of a multi-base classifier, obtaining a classification probability set of each class according to the classification probability vector, obtaining a whole classification probability set according to the classification probability set, sequentially adding the classification probabilities of the classification probability sets of each class to obtain a sum of the classification probabilities of each class, obtaining a set of the sums of the classification probabilities according to the sum of the classification probabilities, obtaining the maximum sum of the classification probabilities from the set of the classification probabilities, and searching the class corresponding to the maximum sum of the classification probabilities, wherein the class is a classification result under the SumNW rule;
the sum of the classification probabilities for each class by the SumNW rule is:
Figure BDA0002903523190000194
the set of products of the SumNW rule classification probabilities is:
SumPro={sum1,sum2,...,sumClaNum}
the sum nw rule states that the product of the maximum classification probabilities is:
max(SumPro)
SumNW rule the search results in the largest product of classification probabilities:
Figure BDA0002903523190000195
SumNW rules the classification results under the rules are:
Figure BDA0002903523190000196
step 4, the classification results generated by the five integration rules are as follows:
Results={Result1,Result2,...,Result5}
wherein Results is the classification Result set, ResultrFor the classification result obtained by the r-th integration rule, r is the [1, 5 ]]The integration rule sequence is: MaxNW, MinNW, MajNW, ProdNW and SumNW;
and 4, the frequency of the classification result is as follows:
ResNums={ResNum1,ResNum2,...,ResNumClaNum}
where ResNums is the set of number of classification results generated by the integration rule, ResNumclanumIndicating the frequency of the appearance of the first class in the classification result set Results;
and 4, the classification result with the maximum frequency is as follows:
MaxResults={MaxResult1,MaxResult2,...,MaxResultMR}
wherein, MaxResults is the classification result set with the largest frequency, MaxResultmrRepresents the mr frequencyMaximum classification result, MR ∈ [1, MR ∈ >]And MR is more than or equal to 1 and less than or equal to 5, and if MR is more than or equal to 1, the final classification result is MaxResult1If MR ≠ 1, randomly selecting a class corresponding to the maximum number from the set as a final classification result;
and 4, the final classification result is as follows:
random=rand(1,MR)
FinalResult=Resultrandom
wherein random ═ rand (1, MR) is from [1, MR]Randomly selects an integer in the interval of (1), Final Result ═ ResultrandomThe random classification result is selected as the final classification result.
The result shows that the method provided by the invention can realize better anomaly detection effect.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (2)

1. A log anomaly detection method based on density weighted integration rules is characterized by comprising the following steps:
step 1: introducing a plurality of software logs, segmenting and analyzing each software log according to separators to obtain a software word data set, performing union processing on a plurality of software word data sets, further performing word de-duplication processing to obtain a word set, counting the frequency of each word in the word set in each log, and further constructing a software log word frequency vector;
step 2: according to the word frequency vector, extracting and determining an initial central point and an initial class number by using a multi-granularity master curve method based on improved complex distribution data spectral clustering, clustering the word frequency vectors and obtaining accurate clusters, and simultaneously obtaining the central point of each cluster, marking all samples in the cluster according to the state of the central point of each cluster, determining the states of all samples according to the state of the central point to obtain a normal cluster and an abnormal cluster, and counting the number of abnormal clusters, counting the number of samples of the abnormal clusters, calculating to obtain the number of samples of a new normal cluster, sampling the normal clusters to obtain new normal clusters, calculating the number of samples of the new abnormal clusters according to the number of samples of the new normal clusters, sampling the abnormal clusters to obtain new abnormal clusters, obtaining a normal log set and an abnormal log set, and constructing a balanced log set through the normal log set and the abnormal log set;
and step 3: the method comprises the following steps that a base classifier takes a balance log set as a training set to carry out optimization training, a multi-base classifier is constructed by using the trained base classifier, samples to be classified are classified by using the multi-base classifier, and each base classifier of the multi-base classifier can generate a classification probability vector;
and 4, step 4: according to the classification probability vector generated by each base classifier of the multi-base classifier, classification results generated by five integration rules are respectively obtained through five integration rules of MaxNW, MinNW, MajNW, ProdNW and SumNW, the five classification results are traversed, if the same classification results exist, the frequency of the classification results is increased by one, the frequency of the classification results is obtained, the classification result with the maximum frequency is selected as the final classification result, and if a plurality of classification results with the maximum frequency exist, one classification result is randomly selected as the final classification result;
step 2, the initial central points and the classes are respectively:
CenterPoint0
Classes={Class1,Class2,...,ClassClaNum}
wherein, CenterPoint0As an initial center point, ClassclassnumIs a ClaNum class, ClaNum belongs to [1, ClaNum];
The precise clusters in step 2 are:
Clusters={Cluster1,Cluster2,...,ClusterCluNum}
wherein, ClusterclunumIs the first cluster, which belongs to [1, CluNum ∈ ]];
Step 2, the central point of each cluster is as follows:
CenterPoints={CenterPoint0,CenterPoint1,...,CenterPointCluNum}
wherein, CenterPointclunumIs the center point of the first cluster, in particular, CenterPointoIs an initial center point;
step 2, the state of the central point of each cluster is as follows:
CenterPointStates={CPState1,CPState2,...,CPStateCluNum}
wherein, CPStateclunumIs the state of the center point of the first cluster, CPStateclunum∈[0,CluNum-1],CPStateclunum0 indicates that the center point of the cluster is normal, CPStateclunumNot equal to 0, the central point of the first cluster is abnormal, the normal state is only 1, and the abnormal state is ClusterNum-1;
step 2, the samples in the cluster are:
Figure FDA0003533751220000022
wherein, ClusterclunumIs the first cluster, Sampleclunum,samnumSamnum sample of the first cluster, SamnumclunumAs the number of samples of the cluster number of the cluster, SamNum ═ 1, SamNumclunum];
Step 2, determining the states of all samples according to the states of the central points as follows:
Figure FDA0003533751220000023
among them, SamStatesclunumSample State set for the clusum cluster, SamStateclunum,samnum(iii) the samnum sample status of the first cluster, and CPStateclunumSimilarly, there is a relationship between the two:
if CPStateciunum=0
then
Figure FDA0003533751220000021
Figure FDA0003533751220000024
if the state of the central point of the first cluster is normal, all the states of the samples in the first cluster are normal, and the first cluster is a normal cluster;
if CPStateclunum=x
then
Figure FDA0003533751220000031
(x∈[1,ClusterNum-1])
if the state of the central point of the first cluster is abnormal, all the states of the samples in the first cluster are abnormal, and the first cluster is an abnormal cluster;
the normal clusters are:
NorClusters0={NorCluster0,1}
among them, NorClusters0Indicating normal clustering, NorCluster0,1The number of the normal clusters is only 1 in the 1 st normal cluster of the normal cluster set;
the abnormal cluster is:
Figure FDA0003533751220000032
wherein, AbnorClusters0 represents an abnormal cluster,
Figure FDA0003533751220000033
n-th of abnormal clusterabnorAn abnormal cluster, nabnor∈[1,Nabnor],NabnorThe number of abnormal clusters;
the number of the samples of the abnormal cluster in the step 2 is as follows:
Figure FDA0003533751220000034
wherein,
Figure FDA0003533751220000035
is n thabnorThe number of samples of an outlier cluster;
the number of samples of the new normal cluster in the step 2 is as follows:
Figure FDA0003533751220000036
the number of samples of the new abnormal cluster in the step 2 is as follows:
Figure FDA0003533751220000037
extracting N from each normal clusternorSamples, resulting in new normal clusters:
NorClusters1={NorCluster1,1}
among them, NorClusters1For normal clustering with 1 round of sampling, NorCluster1,1Repeating the process for the 1 st normal cluster of the normal clusters after 1 round of sampling;
extracting N from each abnormal clusterabnorAnd (4) obtaining a new abnormal cluster by each sample:
Figure FDA0003533751220000041
wherein, AbnorClusters1For an abnormal cluster over 1 round of sampling,
Figure FDA0003533751220000042
for the n-th abnormal cluster of 1-round samplingabnorAn abnormal cluster, nabnor∈[1,Nabnor],NabnorThis process was repeated N times for the number of abnormal clusters sampled through 1 round;
step 2, the normal log sample set is as follows:
NorSet={NorClusters1,NorClusters2,...,NorClustersN}
among them, NorClustersnFor normal clustering after N times of sampling, N belongs to [1, N ]];
Step 2, the abnormal log sample set is as follows:
AbnorSet={AbnorClusters1,AbnorClusters2,...,AbnorClustersN}
wherein, AbnorClustersnThe abnormal cluster is obtained after n times of sampling;
step 2, the balance log set is as follows:
BalanceSet={BS1,BS2,...,BSN}
wherein, BSnTo balance the log set over n samples,
BSn={AbnorClustersn,NorClustersn};
step 3, the base classifier is as follows:
MulClassifier={CS1,CS2,...,CSCSNum}
wherein, the MulClassiier is a multi-base classifier, CScsnumIs a CSNum base classifier, CSNum belongs to [1, CSnum ∈];
Step 3, the samples to be classified:
S
step 3, the classification probability vector generated by each base classifier of the multi-base classifier is as follows:
CScsnum={probcsnum,1*|MKNN(S,1)|,probcsnum,2*|MKNN(S,2)|,...,probcsnum,ClaNum*|MKNN(S,ClaNum)|}
wherein, probcsnum,clanumIs the probability that the csnum base classifiers classify the sample S to be classified into the ClaNum class, MKNN (S, ClaNum) (classnum belongs to [1, ClaNum ∈)]) Representing the mutual neighbor samples of the sample S to be classified in the sample cluster of the first class, | MKNN (S, class) | representing the number of the mutual neighbor samples of the sample S to be classified in the sample cluster of the first class;
step 4, the MaxNW rule is:
obtaining a classification probability vector generated by each base classifier of the multi-base classifiers, obtaining the respective maximum classification probability of each base classifier from the generated classification probability vectors, splicing the maximum classification probabilities into a maximum classification probability set, selecting the maximum classification probability from the maximum classification probability set, and searching to obtain a class corresponding to the maximum classification probability, namely a classification result under a MaxNW rule;
the MaxNW rule describes that the respective maximum classification probability is:
max(CScsnum)
wherein CScsnumIs a CSNum base classifier, CSNum belongs to [1, CSnum ∈];
The maximum classification probability set of the MaxNW rule is:
MaxPro={max(CS1),max(CS2),...,max(CSCSNum)}
wherein, max (CS)csnum) The maximum probability of the csnum base classifiers;
the maximum classification probability in the maximum classification probability set of the MaxNW rule is:
max(MaxPro)
the maximum classification probability obtained by the search according to the MaxNW rule is as follows:
Figure FDA0003533751220000051
wherein,
Figure FDA0003533751220000053
sorting samples S to be sorted to the a-th for csnum base classifiers1Probability of individual class, a1∈[1,ClaNum];
MaxNW rule the classification result under the MaxNW rule is:
Figure FDA0003533751220000052
step 4, the MinNW rule is:
obtaining a classification probability set generated by each base classifier of the multiple base classifiers, obtaining the respective minimum classification probability of each base classifier from the generated classification probability vectors, splicing the minimum classification probabilities into a minimum classification probability set, selecting the minimum classification probability from the minimum classification probability set, and searching to obtain a class corresponding to the minimum classification probability, namely a classification result under a MinNW rule;
the MinNW rule states that the respective minimum classification probabilities are:
min(CScsnum)
wherein CScsnumIs a CSNum base classifier, CSNum belongs to [1, CSnum ∈];
The MinNW rule the minimum set of classification probabilities is:
MinPro={min(CS1),min(CS2),...,min(CSCSNum)}
wherein, min (CS)csnum) The minimum probability of the cSnum-th base classifier;
the MinNW rule specifies the minimum classification probability in the minimum classification probability set as:
min(MinPro)
the minimum classification probability obtained by the search according to the MinNW rule is:
Figure FDA0003533751220000061
wherein,
Figure FDA0003533751220000065
the csnum base classifiers classify the samples S to be classified into a2Probability of individual class, a2∈[1,ClaNum];
MinNW rule the classification results under the MinNW rule are:
Figure FDA0003533751220000062
step 4, the MajNW rule is:
obtaining a classification probability vector generated by each base classifier of the multiple base classifiers, obtaining the respective maximum classification probability of each base classifier from the generated classification probability vectors, splicing the maximum classification probabilities into a maximum classification probability set, counting the frequency of occurrence of the class corresponding to each maximum classification probability in the maximum classification probability set to obtain a frequency set, and searching the frequency set again to obtain the class corresponding to the maximum frequency, namely the classification result under the MajNW rule;
the frequency set of the MajNW rule is:
Count={count1,count2,...,countClaNum}
wherein, countclanumFor the frequency of occurrence of the ClaNum class in the maximum class probability set, ClaNum belongs to [1, ClaNum];
MajNW rule the maximum frequency is:
max(Count)
the maximum frequency obtained by the search in the MajNW rule is:
Figure FDA0003533751220000063
MajNW rule the classification result under the MajNW rule is:
Figure FDA0003533751220000064
step 4, the ProdNW rule is as follows:
obtaining a classification probability vector generated by each base classifier of the multi-base classifiers, obtaining a classification probability set of each class according to the classification probability vector, obtaining a whole classification probability set according to the classification probability set, sequentially multiplying the classification probabilities of the classification probability sets of each class to obtain a product of the classification probability of each class, obtaining a set of products of the classification probabilities according to the product, obtaining a maximum product of the classification probability from the set, and searching the class corresponding to the maximum product of the classification probability, namely a classification result under ProdNW rules;
ProdNW rule the entire set of classification probabilities:
ClassProbability={CP1,CP2,...,CPClaNum}
wherein, CPclanumIs a set of classification probabilities for the ClaNum classes, ClaNum belongs to [1, ClaNum];
The ProdNW rule sets the classification probability of each class as:
CPclanum={prob1,clanum,prob2,clanum,...,probCSNum,clanum}
wherein, probcsnum,clanumProbability of classifying a sample S to be classified into a first class for a csnum classifier;
the ProdNW rule states that the product of the classification probabilities for each class is:
Figure FDA0003533751220000071
the set of products of the ProdNW rule classification probabilities is:
ProdPro={produce1,produce2,...,produceClaNum}
the ProdNW rule states that the product of the maximum classification probabilities is:
max(ProdPro)
ProdNW rule the product of the largest classification probabilities obtained by the search is:
Figure FDA0003533751220000072
ProdNW rules the classification results under the rules are:
Figure FDA0003533751220000073
step 4 the SumNW rule is:
obtaining a classification probability vector generated by each base classifier of a multi-base classifier, obtaining a classification probability set of each class according to the classification probability vector, obtaining a whole classification probability set according to the classification probability set, sequentially adding the classification probabilities of the classification probability sets of each class to obtain a sum of the classification probabilities of each class, obtaining a set of the sums of the classification probabilities according to the sum of the classification probabilities, obtaining the maximum sum of the classification probabilities from the set of the classification probabilities, and searching the class corresponding to the maximum sum of the classification probabilities, wherein the class is a classification result under the SumNW rule;
the sum of the classification probabilities for each class by the SumNW rule is:
Figure FDA0003533751220000081
the set of products of the SumNW rule classification probabilities is:
SumPro={sum1,sum2,...,sumClaNum}
the sum nw rule states that the product of the maximum classification probabilities is:
max(SumPro)
SumNW rule the search results in the largest product of classification probabilities:
Figure FDA0003533751220000082
SumNW rules the classification results under the rules are:
Figure FDA0003533751220000083
step 4, the classification results generated by the five integration rules are as follows:
Results={Result1,Result2,...,Result5}
wherein Results is the classification Result set, ResultrFor the classification result obtained by the r-th integration rule, r is the [1, 5 ]]The integration rule sequence is: MaxNW, MinNW, MajNW, ProdNW and SumNW;
and 4, the frequency of the classification result is as follows:
ResNums={ResNum1,ResNum2,...,ResNumClaNum}
where ResNums is the set of number of classification results generated by the integration rule, ResNumclanumIndicating the frequency of the appearance of the first class in the classification result set Results;
and 4, the classification result with the maximum frequency is as follows:
MaxResults={MaxResult1,MaxResult2,...,MaxResultMR}
wherein, MaxResults is the classification result set with the largest frequency, MaxResultmrRepresents the MR-th classification result with the maximum frequency, and MR belongs to [1, MR ∈]MR is more than or equal to 1 and less than or equal to 5, if MRCN is 1, the final classification result is MaxResult1If MRCN is not equal to 1, randomly selecting a class corresponding to the maximum number from the set as a final classification result;
and 4, the final classification result is as follows:
random=rand(1,MR)
FinalResult=Resultrandom
wherein random ═ rand (1, MR) is from [1, MR]Randomly selects an integer in the interval of (1), Final Result ═ ResultrandomThe random classification result is selected as the final classification result.
2. The log anomaly detection method based on the density-weighted integration rule according to claim 1, wherein:
step 1, each software log is as follows:
Logi
i∈[1,M]
wherein, LogiThe number of the ith software log is M, and the M is the number of the software logs;
step 1, the software word data set comprises:
Figure FDA0003533751220000091
wherein DataiSoftware Word data set, Word, for the ith software logi,jIs the jth software word, N, in the software word dataset of the ith software logiThe number of software words in the software word data set for the ith software log, j belongs to [1, N ]i];
Step 1, the process of solving union set of a plurality of software word data sets comprises the following steps: { Data1,Data2,...,DataM};
Step 1, the word set obtained by the word de-duplication process is:
WordSet={Word1,Word2,...,WordL}
wherein, WordkIs the k-th word in the word set, L is the number of words in the word set, k is [1, L];
Step 1, the frequency of each word in the statistical word set in each log is:
Freqk={Fk,1,Fk,2,...,Fk,M}
wherein, FreqkFor the frequency of occurrence of the kth word in the word set in each software word data set, Freqk,iThe frequency, namely the frequency of occurrence of the k word in the word set in the software word data set of the ith software log is shown, L is the number of the words in the word set, k belongs to [1, L ∈];
Step 1, constructing a word frequency vector of the software log comprises the following steps:
Vectori={F1,i,F2,i,...,FL,i}
i∈[1,M]
wherein, VectoriThe word frequency vector of the ith software log is shown, and M is the number of the software logs.
CN202110063328.8A 2021-01-18 2021-01-18 Log anomaly detection method based on density weighted integration rule Active CN112711665B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110063328.8A CN112711665B (en) 2021-01-18 2021-01-18 Log anomaly detection method based on density weighted integration rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110063328.8A CN112711665B (en) 2021-01-18 2021-01-18 Log anomaly detection method based on density weighted integration rule

Publications (2)

Publication Number Publication Date
CN112711665A CN112711665A (en) 2021-04-27
CN112711665B true CN112711665B (en) 2022-04-15

Family

ID=75549241

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110063328.8A Active CN112711665B (en) 2021-01-18 2021-01-18 Log anomaly detection method based on density weighted integration rule

Country Status (1)

Country Link
CN (1) CN112711665B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388551A (en) * 2017-08-07 2019-02-26 北京京东尚科信息技术有限公司 There are the method for loophole probability, leak detection method, relevant apparatus for prediction code
CN110647446A (en) * 2018-06-26 2020-01-03 中兴通讯股份有限公司 Log fault association and prediction method, device, equipment and storage medium
CN111178537A (en) * 2019-12-09 2020-05-19 华为技术有限公司 Feature extraction model training method and device
CN111611218A (en) * 2020-04-24 2020-09-01 武汉大学 Distributed abnormal log automatic identification method based on deep learning
JP2020140423A (en) * 2019-02-28 2020-09-03 Kddi株式会社 Clustering apparatus, clustering method, and clustering program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388551A (en) * 2017-08-07 2019-02-26 北京京东尚科信息技术有限公司 There are the method for loophole probability, leak detection method, relevant apparatus for prediction code
CN110647446A (en) * 2018-06-26 2020-01-03 中兴通讯股份有限公司 Log fault association and prediction method, device, equipment and storage medium
JP2020140423A (en) * 2019-02-28 2020-09-03 Kddi株式会社 Clustering apparatus, clustering method, and clustering program
CN111178537A (en) * 2019-12-09 2020-05-19 华为技术有限公司 Feature extraction model training method and device
CN111611218A (en) * 2020-04-24 2020-09-01 武汉大学 Distributed abnormal log automatic identification method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
日志异常检测技术研究;杨瑞朋、屈丹、朱少卫、黄浩;《信息工程大学学报》;20191031;第【0610】-【0615】页 *

Also Published As

Publication number Publication date
CN112711665A (en) 2021-04-27

Similar Documents

Publication Publication Date Title
CN108664971B (en) Pulmonary nodule detection method based on 2D convolutional neural network
CN108322347B (en) Data detection method, device, detection server and storage medium
CN109257383B (en) BGP anomaly detection method and system
Dahlin et al. Ensemble approaches for improving community detection methods
CN114722746A (en) Chip aided design method, device, equipment and readable medium
Huang et al. Improving diagnosis efficiency via machine learning
CN111507504A (en) Adaboost integrated learning power grid fault diagnosis system and method based on data resampling
CN112613309A (en) Log classification analysis method, device and equipment and readable storage medium
Hebert Predicting rare failure events using classification trees on large scale manufacturing data with complex interactions
CN109886284A (en) Fraud detection method and system based on hierarchical clustering
CN110544047A (en) Bad data identification method
CN112685324A (en) Method and system for generating test scheme
CN113537321A (en) Network traffic anomaly detection method based on isolated forest and X-means
Huang et al. Towards smarter diagnosis: A learning-based diagnostic outcome previewer
CN110189799B (en) Metagenome feature selection method based on variable importance score and Neyman Pearson test
CN117725437B (en) Machine learning-based data accurate matching analysis method
CN115329663A (en) Key feature selection method and device for processing power load monitoring sparse data
CN115062111A (en) Fault duplicate report generation method and device and electronic equipment
CN113360392A (en) Cross-project software defect prediction method and device
CN112711665B (en) Log anomaly detection method based on density weighted integration rule
CN108537249A (en) A kind of industrial process data clustering method of density peaks cluster
CN114511905A (en) Face clustering method based on graph convolution neural network
CN114692781A (en) Intelligent electric meter fault unbalance classification method based on MSL-XGboost model
CN115278752A (en) AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system
CN108898264B (en) Method and device for calculating quality metric index of overlapping community set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant