CN111159508A - Anomaly detection algorithm integration method and system based on algorithm diversity - Google Patents
Anomaly detection algorithm integration method and system based on algorithm diversity Download PDFInfo
- Publication number
- CN111159508A CN111159508A CN201911406458.6A CN201911406458A CN111159508A CN 111159508 A CN111159508 A CN 111159508A CN 201911406458 A CN201911406458 A CN 201911406458A CN 111159508 A CN111159508 A CN 111159508A
- Authority
- CN
- China
- Prior art keywords
- algorithm
- correlation coefficient
- anomaly detection
- classification
- anomaly
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
Abstract
The invention provides an anomaly detection algorithm integration method based on algorithm diversity, which comprises the following steps: s01, establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting a sample set, and processing a prediction result to generate a pseudo label; s02, calculating a correlation coefficient between a prediction result and a pseudo label of each basic trainer; s03, classifying all anomaly detection algorithms; s04, selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification, and establishing an algorithm combination; and S05, carrying out anomaly detection by using an algorithm combination, and outputting an anomaly point. The method introduces the diversity model integration thought of supervised learning into the anomaly detection, classifies anomaly detection algorithms according to the implementation mechanism of the algorithms, selects the algorithms belonging to different classifications for integration, and improves the prediction precision of the integration scheme on different local distribution anomaly points.
Description
Technical Field
The invention relates to the technical field of data anomaly detection, in particular to an anomaly detection algorithm integration method and system based on algorithm diversity.
Background
In the anomaly detection in the field of unsupervised learning, a plurality of algorithms are implemented at present, but the algorithms are implemented based on single data distribution, and perform well under the condition of conforming to the data distribution, and perform poorly under other distributions. These algorithms each perform a good or bad in different data sets and there is no absolute best algorithm. Figures 1 and 2 show ROC performance and Precision @ n performance (https:// PyOD. readthetadocs. io/en/test/benchmark. html. for anomaly detection algorithms selected from PyOD libraries) for different datasets, and it can be seen from figure 1 and figure 2 that the difference between the bold data and the non-bold data is large. The cluttered data in the graph illustrates that each algorithm does not predict well on every data set.
In the prior art, in each abnormality detection task, selection needs to be carried out in dozens of abnormality detection algorithms. However, the integration of anomaly detection algorithms is still under study, and a mature algorithm selection rule is lacked.
Disclosure of Invention
The technical problem to be solved by the invention is that in the anomaly detection technology, the selected algorithm or algorithm combination does not have good prediction effect on each data set.
The invention solves the technical problems through the following technical means:
an anomaly detection algorithm integration method based on algorithm diversity comprises the following steps:
s01, establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting a sample set, and processing a prediction result to generate a pseudo label;
s02, calculating a correlation coefficient between a prediction result and a pseudo label of each basic trainer;
s03, classifying all anomaly detection algorithms;
s04, selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification, and establishing an algorithm combination;
and S05, carrying out anomaly detection by using an algorithm combination, and outputting an anomaly point.
The method introduces the diversity model integration thought of supervised learning into the anomaly detection, classifies anomaly detection algorithms according to the implementation mechanism of the algorithms, selects the algorithms belonging to different classifications for integration, and improves the prediction precision of the integration scheme on different local distribution anomaly points.
Preferably, in step S01, functions of prediction results of the multiple abnormality detection algorithms are collected as pseudo labels; the summary function is an average or maximum or a maximum of an average.
Preferably, in step S03, the principle of classifying all the anomaly detection algorithms is as follows: the classification is made according to the implementation mechanism of the algorithm.
Preferably, in the step S04, the specific method for selecting the TOPN algorithm with the highest correlation coefficient and higher than the set threshold value is as follows:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
The invention also provides an anomaly detection algorithm integration system based on algorithm diversity, which comprises
The pseudo label generating module is used for establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting the sample set and processing the prediction result to generate a pseudo label;
the correlation coefficient calculation module is used for calculating the correlation coefficient between the prediction result and the pseudo label of each basic trainer;
the algorithm classification module provides a human-computer interface and is used for classifying the algorithm used by the basic trainer;
the algorithm selection module is used for selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification and establishing an algorithm combination;
and the anomaly prediction module is used for performing anomaly detection by using the algorithm combination and outputting an anomaly point.
Preferably, in the pseudo tag generation module, functions of prediction results of multiple anomaly detection algorithms are summarized as pseudo tags; the summary function is an average or maximum or a maximum of an average.
Preferably, in the algorithm classification module, the principle of classifying all the anomaly detection algorithms is as follows: the classification is made according to the implementation mechanism of the algorithm.
Preferably, in the algorithm selection module, the specific method for selecting the top n algorithm with the highest correlation coefficient and higher than the set threshold value is as follows:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
The invention has the advantages that:
the invention introduces the diversity model integration thought of supervised learning into the anomaly detection, provides the classification of anomaly detection algorithms according to the realization mechanism of the algorithms, selects the algorithms belonging to different classifications for integration, and improves the prediction precision of the integration scheme on different locally distributed anomaly points.
Drawings
FIG. 1 is a diagram illustrating ROC behavior of anomaly detection algorithms in different datasets in the background of the present invention;
FIG. 2 is a diagram illustrating the Precision @ n behavior of an anomaly detection algorithm in different datasets according to the background of the present invention;
FIG. 3 is a block diagram of a flow chart of an algorithm diversity-based anomaly detection algorithm integration method according to embodiment 1 of the present invention;
fig. 4 is a block diagram of a structure of an anomaly detection algorithm integration method based on algorithm diversity in embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the present embodiment provides an anomaly detection algorithm integration method based on algorithm diversity, which includes the following steps:
and S01, establishing a basic trainer by using a plurality of abnormal detection algorithms, and generating a pseudo label.
The core functional point of the algorithm integration is the selection of the algorithm, namely, the selection of the algorithm combination participating in the integration is cut off through the evaluation of the algorithm effect. However, the anomaly detection belongs to the category of unsupervised learning, and there is no sample label similar to supervised learning, and in the embodiment, a plurality of functions of the anomaly detection algorithm prediction results are summarized as pseudo labels.
The prediction result of the single anomaly detection algorithm is a vector consisting of the probability of the predicted anomaly of each data point (the number between 0 and 1 represents that the anomaly degree is higher as the number is closer to 1), the generation of the pseudo label is to sum functions of the same-position components (representing the same data point) of all the prediction vectors, and the current commonly used sum functions comprise Average (averaging), Maximum (averaging), AOM (averaging of Maximum ), MOA (Maximum averaging of Average), and the like.
And S02, calculating the correlation coefficient between the prediction result and the pseudo label for each basic trainer.
The core idea of the pseudo label is as follows: while no real tag can be used to measure the absolute effect of each algorithm, a pseudo tag can be used to measure the relative merits between algorithms given the assumption that they are highly correlated with real tags.
The specific measurement method is to calculate the correlation coefficient between the prediction vector of each algorithm and the pseudo label, and the larger the correlation coefficient is, the better the prediction effect of the algorithm is.
And S03, classifying the algorithm according to the implementation mechanism.
The classification method of the anomaly detection algorithm that can be referred to is as follows:
distance metric based algorithm: KNN (K Nearest Neighbors, K Neighbors), HBOS (Histogram-based Outlier Score, Histogram-based anomaly detection);
relative density based algorithm: LOF (Local Outlier Factor), COF (Connectivity based Outlier Factor, connection based Outlier Factor);
tree-based algorithms: iForest (Isolation Forest);
linear-based algorithms: OCSVM (One-Class Support Vector Machines), PCA (Principal Component Analysis);
probability-based algorithms: ABOD (Angle-Based Outlier Detection, Angle-Based anomaly Detection), SOS (Stochastic anomaly Detection).
And S04, selecting the TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification, and establishing an algorithm combination.
The process is as follows:
correlation coefficient threshold of 0.8
Correlation coefficient ranking threshold of 3
Initializing the algorithm composition list to generate an empty list
Algorithm dictionary { 'distance-based algorithm' [ K neighbors, histogram-based anomaly detection ],
a 'density-based algorithm' [ local anomaly factor, connection-based anomaly factor ],
'Tree model based Algorithm' [ isolated forest ],
' Linear-based Algorithm [ Single-class support vector machine, principal component analysis ],
'probability-based Algorithm' [ Angle-based anomaly detection, random anomaly detection ] }
And (3) algorithm classification of a circular traversal algorithm dictionary:
and (3) circularly traversing the algorithms in a certain algorithm classification:
if (correlation coefficient of algorithm > -correlation coefficient threshold) and (correlation coefficient ranking of algorithm < correlation coefficient ranking threshold):
adding algorithms to a combined list of algorithms
And S05, carrying out abnormality detection by using algorithm combination and outputting an abnormality point.
The operation process of the step is the same as that of general anomaly integrated detection, and the difference is that only the algorithm combination selected by S04 is used for respectively carrying out single model prediction on the samples, functions such as Average, maximum, AOM, MOA and the like are used for summarizing, and whether the sample data points are abnormal or not and the prediction result of the abnormal probability are output.
The method introduces the diversity model integration thought of supervised learning into the anomaly detection, classifies anomaly detection algorithms according to the implementation mechanism of the algorithms, selects the algorithms belonging to different classifications for integration, and improves the prediction precision of the integration scheme on different local distribution anomaly points.
Example 2
As shown in fig. 4, the present invention also discloses an anomaly detection algorithm integration system based on algorithm diversity, which includes:
the pseudo label generating module is used for establishing a basic trainer by using a plurality of anomaly detection algorithms to generate pseudo labels; in the pseudo label generating module, summarizing functions of prediction results of various abnormal detection algorithms as pseudo labels; the summary function is an average or maximum or a maximum of an average.
The correlation coefficient calculation module is used for calculating the correlation coefficient between the prediction result and the pseudo label of each basic trainer;
the algorithm classification module provides a human-computer interface and is used for classifying the algorithm used by the basic trainer; in the algorithm classification module, the principle of classifying all the anomaly detection algorithms is as follows: the classification is made according to the implementation mechanism of the algorithm.
The algorithm selection module is used for selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification and establishing an algorithm combination; in the algorithm selection module, the specific method for selecting the TOPN algorithm with the highest correlation coefficient and higher than the set threshold value is as follows:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
And the anomaly prediction module is used for performing anomaly detection by using the algorithm combination and outputting an anomaly point.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (8)
1. An anomaly detection algorithm integration method based on algorithm diversity is characterized by comprising the following steps:
s01, establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting a sample set, and processing a prediction result to generate a pseudo label;
s02, calculating a correlation coefficient between a prediction result and a pseudo label of each basic trainer;
s03, classifying all anomaly detection algorithms;
s04, selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification, and establishing an algorithm combination;
and S05, carrying out anomaly detection by using an algorithm combination, and outputting an anomaly point.
2. The method for integrating anomaly detection algorithms based on algorithm diversity according to claim 1, wherein in step S01, functions of prediction results of multiple anomaly detection algorithms are summarized as pseudo labels; the summary function is an average or maximum or a maximum of an average.
3. The method for integrating anomaly detection algorithms based on algorithm diversity according to claim 1, wherein in step S03, the principle of classifying all anomaly detection algorithms is: the classification is made according to the implementation mechanism of the algorithm.
4. The method as claimed in claim 1, wherein the step S04 of selecting the TOPN algorithm with the highest correlation coefficient and higher than the predetermined threshold comprises:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
5. An anomaly detection algorithm integration system based on algorithm diversity is characterized by comprising
The pseudo label generating module is used for establishing a plurality of basic trainers by using a plurality of anomaly detection algorithms, respectively predicting the sample set and processing the prediction result to generate a pseudo label;
the correlation coefficient calculation module is used for calculating the correlation coefficient between the prediction result and the pseudo label of each basic trainer;
the algorithm classification module provides a human-computer interface and is used for classifying the algorithm used by the basic trainer;
the algorithm selection module is used for selecting a TOPN algorithm with the highest correlation coefficient and higher than a set threshold value for each classification and establishing an algorithm combination;
and the anomaly prediction module is used for performing anomaly detection by using the algorithm combination and outputting an anomaly point.
6. The system of claim 5, wherein the pseudo tag generation module summarizes functions of prediction results of multiple anomaly detection algorithms as pseudo tags; the summary function is an average or maximum or a maximum of an average.
7. The method for integrating the anomaly detection algorithms based on algorithm diversity according to claim 5, wherein the principle of classifying all the anomaly detection algorithms in the algorithm classification module is as follows: the classification is made according to the implementation mechanism of the algorithm.
8. The method as claimed in claim 5, wherein the algorithm selection module selects the TOPN algorithm with the highest correlation coefficient and higher than the set threshold as follows:
1) determining a correlation coefficient threshold value and a correlation coefficient ranking threshold value;
2) initializing an algorithm combination list to generate a null list;
3) constructing an algorithm dictionary, wherein the algorithm dictionary comprises all algorithm classifications;
4) and circularly traversing the algorithm classification of the algorithm dictionary, circularly traversing the algorithm in a certain algorithm classification, and adding the algorithm into an algorithm combination list if the correlation coefficient of the algorithm is more than or equal to the correlation coefficient threshold and the correlation coefficient rank of the algorithm is less than the correlation coefficient rank threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911406458.6A CN111159508A (en) | 2019-12-31 | 2019-12-31 | Anomaly detection algorithm integration method and system based on algorithm diversity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911406458.6A CN111159508A (en) | 2019-12-31 | 2019-12-31 | Anomaly detection algorithm integration method and system based on algorithm diversity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111159508A true CN111159508A (en) | 2020-05-15 |
Family
ID=70559715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911406458.6A Pending CN111159508A (en) | 2019-12-31 | 2019-12-31 | Anomaly detection algorithm integration method and system based on algorithm diversity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111159508A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377568A (en) * | 2021-06-29 | 2021-09-10 | 北京同创永益科技发展有限公司 | Abnormity detection method and device, electronic equipment and storage medium |
CN113515678A (en) * | 2021-05-13 | 2021-10-19 | 上海梯之星信息科技有限公司 | Abnormal data screening method |
-
2019
- 2019-12-31 CN CN201911406458.6A patent/CN111159508A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113515678A (en) * | 2021-05-13 | 2021-10-19 | 上海梯之星信息科技有限公司 | Abnormal data screening method |
CN113377568A (en) * | 2021-06-29 | 2021-09-10 | 北京同创永益科技发展有限公司 | Abnormity detection method and device, electronic equipment and storage medium |
CN113377568B (en) * | 2021-06-29 | 2023-10-20 | 北京同创永益科技发展有限公司 | Abnormality detection method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107067025B (en) | Text data automatic labeling method based on active learning | |
CN113098723B (en) | Fault root cause positioning method and device, storage medium and equipment | |
CN110046634B (en) | Interpretation method and device of clustering result | |
JPH0636038A (en) | Feature classification using supervisory statistic pattern recognition | |
CN102117411B (en) | Method and system for constructing multi-level classification model | |
CN114124482B (en) | Access flow anomaly detection method and equipment based on LOF and isolated forest | |
CN116451139B (en) | Live broadcast data rapid analysis method based on artificial intelligence | |
CN114861788A (en) | Load abnormity detection method and system based on DBSCAN clustering | |
CN111159508A (en) | Anomaly detection algorithm integration method and system based on algorithm diversity | |
EP4053757A1 (en) | Degradation suppression program, degradation suppression method, and information processing device | |
CN109993391B (en) | Method, device, equipment and medium for dispatching network operation and maintenance task work order | |
CN116662817B (en) | Asset identification method and system of Internet of things equipment | |
CN114399321A (en) | Business system stability analysis method, device and equipment | |
CN109902731B (en) | Performance fault detection method and device based on support vector machine | |
WO2022111284A1 (en) | Data labeling processing method and apparatus, and storage medium and electronic apparatus | |
WO2017188048A1 (en) | Preparation apparatus, preparation program, and preparation method | |
CN117156442B (en) | Cloud data security protection method and system based on 5G network | |
CN115705279A (en) | Intelligent fault early warning method and device based on index data | |
CN112817954A (en) | Missing value interpolation method based on multi-method ensemble learning | |
CN114492569B (en) | Typhoon path classification method based on width learning system | |
CN114511905A (en) | Face clustering method based on graph convolution neural network | |
CN114528906A (en) | Fault diagnosis method, device, equipment and medium for rotary machine | |
Gias et al. | SampleHST: Efficient On-the-Fly Selection of Distributed Traces | |
Burmeister et al. | Exploration of production data for predictive maintenance of industrial equipment: A case study | |
CN112990425A (en) | Automatic classification method of 5G network slices, device thereof, electronic equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |