CN111159508A - Anomaly detection algorithm integration method and system based on algorithm diversity - Google Patents

Anomaly detection algorithm integration method and system based on algorithm diversity Download PDF

Info

Publication number
CN111159508A
CN111159508A CN201911406458.6A CN201911406458A CN111159508A CN 111159508 A CN111159508 A CN 111159508A CN 201911406458 A CN201911406458 A CN 201911406458A CN 111159508 A CN111159508 A CN 111159508A
Authority
CN
China
Prior art keywords
algorithm
correlation coefficient
anomaly detection
classification
anomaly
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911406458.6A
Other languages
Chinese (zh)
Inventor
梁淑云
刘胜
马影
陶景龙
王启凡
魏国富
徐�明
殷钱安
余贤喆
周晓勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN201911406458.6A priority Critical patent/CN111159508A/en
Publication of CN111159508A publication Critical patent/CN111159508A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Abstract

The invention provides an anomaly detection algorithm integration method based on algorithm diversity, which comprises the following steps: s01, establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting a sample set, and processing a prediction result to generate a pseudo label; s02, calculating a correlation coefficient between a prediction result and a pseudo label of each basic trainer; s03, classifying all anomaly detection algorithms; s04, selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification, and establishing an algorithm combination; and S05, carrying out anomaly detection by using an algorithm combination, and outputting an anomaly point. The method introduces the diversity model integration thought of supervised learning into the anomaly detection, classifies anomaly detection algorithms according to the implementation mechanism of the algorithms, selects the algorithms belonging to different classifications for integration, and improves the prediction precision of the integration scheme on different local distribution anomaly points.

Description

Anomaly detection algorithm integration method and system based on algorithm diversity
Technical Field
The invention relates to the technical field of data anomaly detection, in particular to an anomaly detection algorithm integration method and system based on algorithm diversity.
Background
In the anomaly detection in the field of unsupervised learning, a plurality of algorithms are implemented at present, but the algorithms are implemented based on single data distribution, and perform well under the condition of conforming to the data distribution, and perform poorly under other distributions. These algorithms each perform a good or bad in different data sets and there is no absolute best algorithm. Figures 1 and 2 show ROC performance and Precision @ n performance (https:// PyOD. readthetadocs. io/en/test/benchmark. html. for anomaly detection algorithms selected from PyOD libraries) for different datasets, and it can be seen from figure 1 and figure 2 that the difference between the bold data and the non-bold data is large. The cluttered data in the graph illustrates that each algorithm does not predict well on every data set.
In the prior art, in each abnormality detection task, selection needs to be carried out in dozens of abnormality detection algorithms. However, the integration of anomaly detection algorithms is still under study, and a mature algorithm selection rule is lacked.
Disclosure of Invention
The technical problem to be solved by the invention is that in the anomaly detection technology, the selected algorithm or algorithm combination does not have good prediction effect on each data set.
The invention solves the technical problems through the following technical means:
an anomaly detection algorithm integration method based on algorithm diversity comprises the following steps:
s01, establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting a sample set, and processing a prediction result to generate a pseudo label;
s02, calculating a correlation coefficient between a prediction result and a pseudo label of each basic trainer;
s03, classifying all anomaly detection algorithms;
s04, selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification, and establishing an algorithm combination;
and S05, carrying out anomaly detection by using an algorithm combination, and outputting an anomaly point.
The method introduces the diversity model integration thought of supervised learning into the anomaly detection, classifies anomaly detection algorithms according to the implementation mechanism of the algorithms, selects the algorithms belonging to different classifications for integration, and improves the prediction precision of the integration scheme on different local distribution anomaly points.
Preferably, in step S01, functions of prediction results of the multiple abnormality detection algorithms are collected as pseudo labels; the summary function is an average or maximum or a maximum of an average.
Preferably, in step S03, the principle of classifying all the anomaly detection algorithms is as follows: the classification is made according to the implementation mechanism of the algorithm.
Preferably, in the step S04, the specific method for selecting the TOPN algorithm with the highest correlation coefficient and higher than the set threshold value is as follows:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
The invention also provides an anomaly detection algorithm integration system based on algorithm diversity, which comprises
The pseudo label generating module is used for establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting the sample set and processing the prediction result to generate a pseudo label;
the correlation coefficient calculation module is used for calculating the correlation coefficient between the prediction result and the pseudo label of each basic trainer;
the algorithm classification module provides a human-computer interface and is used for classifying the algorithm used by the basic trainer;
the algorithm selection module is used for selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification and establishing an algorithm combination;
and the anomaly prediction module is used for performing anomaly detection by using the algorithm combination and outputting an anomaly point.
Preferably, in the pseudo tag generation module, functions of prediction results of multiple anomaly detection algorithms are summarized as pseudo tags; the summary function is an average or maximum or a maximum of an average.
Preferably, in the algorithm classification module, the principle of classifying all the anomaly detection algorithms is as follows: the classification is made according to the implementation mechanism of the algorithm.
Preferably, in the algorithm selection module, the specific method for selecting the top n algorithm with the highest correlation coefficient and higher than the set threshold value is as follows:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
The invention has the advantages that:
the invention introduces the diversity model integration thought of supervised learning into the anomaly detection, provides the classification of anomaly detection algorithms according to the realization mechanism of the algorithms, selects the algorithms belonging to different classifications for integration, and improves the prediction precision of the integration scheme on different locally distributed anomaly points.
Drawings
FIG. 1 is a diagram illustrating ROC behavior of anomaly detection algorithms in different datasets in the background of the present invention;
FIG. 2 is a diagram illustrating the Precision @ n behavior of an anomaly detection algorithm in different datasets according to the background of the present invention;
FIG. 3 is a block diagram of a flow chart of an algorithm diversity-based anomaly detection algorithm integration method according to embodiment 1 of the present invention;
fig. 4 is a block diagram of a structure of an anomaly detection algorithm integration method based on algorithm diversity in embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the present embodiment provides an anomaly detection algorithm integration method based on algorithm diversity, which includes the following steps:
and S01, establishing a basic trainer by using a plurality of abnormal detection algorithms, and generating a pseudo label.
The core functional point of the algorithm integration is the selection of the algorithm, namely, the selection of the algorithm combination participating in the integration is cut off through the evaluation of the algorithm effect. However, the anomaly detection belongs to the category of unsupervised learning, and there is no sample label similar to supervised learning, and in the embodiment, a plurality of functions of the anomaly detection algorithm prediction results are summarized as pseudo labels.
The prediction result of the single anomaly detection algorithm is a vector consisting of the probability of the predicted anomaly of each data point (the number between 0 and 1 represents that the anomaly degree is higher as the number is closer to 1), the generation of the pseudo label is to sum functions of the same-position components (representing the same data point) of all the prediction vectors, and the current commonly used sum functions comprise Average (averaging), Maximum (averaging), AOM (averaging of Maximum ), MOA (Maximum averaging of Average), and the like.
And S02, calculating the correlation coefficient between the prediction result and the pseudo label for each basic trainer.
The core idea of the pseudo label is as follows: while no real tag can be used to measure the absolute effect of each algorithm, a pseudo tag can be used to measure the relative merits between algorithms given the assumption that they are highly correlated with real tags.
The specific measurement method is to calculate the correlation coefficient between the prediction vector of each algorithm and the pseudo label, and the larger the correlation coefficient is, the better the prediction effect of the algorithm is.
And S03, classifying the algorithm according to the implementation mechanism.
The classification method of the anomaly detection algorithm that can be referred to is as follows:
distance metric based algorithm: KNN (K Nearest Neighbors, K Neighbors), HBOS (Histogram-based Outlier Score, Histogram-based anomaly detection);
relative density based algorithm: LOF (Local Outlier Factor), COF (Connectivity based Outlier Factor, connection based Outlier Factor);
tree-based algorithms: iForest (Isolation Forest);
linear-based algorithms: OCSVM (One-Class Support Vector Machines), PCA (Principal Component Analysis);
probability-based algorithms: ABOD (Angle-Based Outlier Detection, Angle-Based anomaly Detection), SOS (Stochastic anomaly Detection).
And S04, selecting the TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification, and establishing an algorithm combination.
The process is as follows:
correlation coefficient threshold of 0.8
Correlation coefficient ranking threshold of 3
Initializing the algorithm composition list to generate an empty list
Algorithm dictionary { 'distance-based algorithm' [ K neighbors, histogram-based anomaly detection ],
a 'density-based algorithm' [ local anomaly factor, connection-based anomaly factor ],
'Tree model based Algorithm' [ isolated forest ],
' Linear-based Algorithm [ Single-class support vector machine, principal component analysis ],
'probability-based Algorithm' [ Angle-based anomaly detection, random anomaly detection ] }
And (3) algorithm classification of a circular traversal algorithm dictionary:
and (3) circularly traversing the algorithms in a certain algorithm classification:
if (correlation coefficient of algorithm > -correlation coefficient threshold) and (correlation coefficient ranking of algorithm < correlation coefficient ranking threshold):
adding algorithms to a combined list of algorithms
And S05, carrying out abnormality detection by using algorithm combination and outputting an abnormality point.
The operation process of the step is the same as that of general anomaly integrated detection, and the difference is that only the algorithm combination selected by S04 is used for respectively carrying out single model prediction on the samples, functions such as Average, maximum, AOM, MOA and the like are used for summarizing, and whether the sample data points are abnormal or not and the prediction result of the abnormal probability are output.
The method introduces the diversity model integration thought of supervised learning into the anomaly detection, classifies anomaly detection algorithms according to the implementation mechanism of the algorithms, selects the algorithms belonging to different classifications for integration, and improves the prediction precision of the integration scheme on different local distribution anomaly points.
Example 2
As shown in fig. 4, the present invention also discloses an anomaly detection algorithm integration system based on algorithm diversity, which includes:
the pseudo label generating module is used for establishing a basic trainer by using a plurality of anomaly detection algorithms to generate pseudo labels; in the pseudo label generating module, summarizing functions of prediction results of various abnormal detection algorithms as pseudo labels; the summary function is an average or maximum or a maximum of an average.
The correlation coefficient calculation module is used for calculating the correlation coefficient between the prediction result and the pseudo label of each basic trainer;
the algorithm classification module provides a human-computer interface and is used for classifying the algorithm used by the basic trainer; in the algorithm classification module, the principle of classifying all the anomaly detection algorithms is as follows: the classification is made according to the implementation mechanism of the algorithm.
The algorithm selection module is used for selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification and establishing an algorithm combination; in the algorithm selection module, the specific method for selecting the TOPN algorithm with the highest correlation coefficient and higher than the set threshold value is as follows:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
And the anomaly prediction module is used for performing anomaly detection by using the algorithm combination and outputting an anomaly point.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. An anomaly detection algorithm integration method based on algorithm diversity is characterized by comprising the following steps:
s01, establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting a sample set, and processing a prediction result to generate a pseudo label;
s02, calculating a correlation coefficient between a prediction result and a pseudo label of each basic trainer;
s03, classifying all anomaly detection algorithms;
s04, selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification, and establishing an algorithm combination;
and S05, carrying out anomaly detection by using an algorithm combination, and outputting an anomaly point.
2. The method for integrating anomaly detection algorithms based on algorithm diversity according to claim 1, wherein in step S01, functions of prediction results of multiple anomaly detection algorithms are summarized as pseudo labels; the summary function is an average or maximum or a maximum of an average.
3. The method for integrating anomaly detection algorithms based on algorithm diversity according to claim 1, wherein in step S03, the principle of classifying all anomaly detection algorithms is: the classification is made according to the implementation mechanism of the algorithm.
4. The method as claimed in claim 1, wherein the step S04 of selecting the TOPN algorithm with the highest correlation coefficient and higher than the predetermined threshold comprises:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
5. An anomaly detection algorithm integration system based on algorithm diversity is characterized by comprising
The pseudo label generating module is used for establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting the sample set and processing the prediction result to generate a pseudo label;
the correlation coefficient calculation module is used for calculating the correlation coefficient between the prediction result and the pseudo label of each basic trainer;
the algorithm classification module provides a human-computer interface and is used for classifying the algorithm used by the basic trainer;
the algorithm selection module is used for selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification and establishing an algorithm combination;
and the anomaly prediction module is used for performing anomaly detection by using the algorithm combination and outputting an anomaly point.
6. The system of claim 5, wherein the pseudo tag generation module summarizes functions of prediction results of multiple anomaly detection algorithms as pseudo tags; the summary function is an average or maximum or a maximum of an average.
7. The method for integrating the anomaly detection algorithms based on algorithm diversity according to claim 5, wherein the principle of classifying all the anomaly detection algorithms in the algorithm classification module is as follows: the classification is made according to the implementation mechanism of the algorithm.
8. The method as claimed in claim 5, wherein the algorithm selection module selects the TOPN algorithm with the highest correlation coefficient and higher than the set threshold as follows:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
CN201911406458.6A 2019-12-31 2019-12-31 Anomaly detection algorithm integration method and system based on algorithm diversity Pending CN111159508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911406458.6A CN111159508A (en) 2019-12-31 2019-12-31 Anomaly detection algorithm integration method and system based on algorithm diversity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911406458.6A CN111159508A (en) 2019-12-31 2019-12-31 Anomaly detection algorithm integration method and system based on algorithm diversity

Publications (1)

Publication Number Publication Date
CN111159508A true CN111159508A (en) 2020-05-15

Family

ID=70559715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911406458.6A Pending CN111159508A (en) 2019-12-31 2019-12-31 Anomaly detection algorithm integration method and system based on algorithm diversity

Country Status (1)

Country Link
CN (1) CN111159508A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377568A (en) * 2021-06-29 2021-09-10 北京同创永益科技发展有限公司 Abnormity detection method and device, electronic equipment and storage medium
CN113515678A (en) * 2021-05-13 2021-10-19 上海梯之星信息科技有限公司 Abnormal data screening method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515678A (en) * 2021-05-13 2021-10-19 上海梯之星信息科技有限公司 Abnormal data screening method
CN113377568A (en) * 2021-06-29 2021-09-10 北京同创永益科技发展有限公司 Abnormity detection method and device, electronic equipment and storage medium
CN113377568B (en) * 2021-06-29 2023-10-20 北京同创永益科技发展有限公司 Abnormality detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107067025B (en) Text data automatic labeling method based on active learning
CN113098723B (en) Fault root cause positioning method and device, storage medium and equipment
CN110046634B (en) Interpretation method and device of clustering result
JPH0636038A (en) Feature classification using supervisory statistic pattern recognition
CN102117411B (en) Method and system for constructing multi-level classification model
CN114124482B (en) Access flow anomaly detection method and equipment based on LOF and isolated forest
CN116451139B (en) Live broadcast data rapid analysis method based on artificial intelligence
CN114861788A (en) Load abnormity detection method and system based on DBSCAN clustering
CN111159508A (en) Anomaly detection algorithm integration method and system based on algorithm diversity
EP4053757A1 (en) Degradation suppression program, degradation suppression method, and information processing device
CN109993391B (en) Method, device, equipment and medium for dispatching network operation and maintenance task work order
CN116662817B (en) Asset identification method and system of Internet of things equipment
CN114399321A (en) Business system stability analysis method, device and equipment
CN109902731B (en) Performance fault detection method and device based on support vector machine
WO2022111284A1 (en) Data labeling processing method and apparatus, and storage medium and electronic apparatus
WO2017188048A1 (en) Preparation apparatus, preparation program, and preparation method
CN117156442B (en) Cloud data security protection method and system based on 5G network
CN115705279A (en) Intelligent fault early warning method and device based on index data
CN112817954A (en) Missing value interpolation method based on multi-method ensemble learning
CN114492569B (en) Typhoon path classification method based on width learning system
CN114511905A (en) Face clustering method based on graph convolution neural network
CN114528906A (en) Fault diagnosis method, device, equipment and medium for rotary machine
Gias et al. SampleHST: Efficient On-the-Fly Selection of Distributed Traces
Burmeister et al. Exploration of production data for predictive maintenance of industrial equipment: A case study
CN112990425A (en) Automatic classification method of 5G network slices, device thereof, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination