CN111666169A - Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method - Google Patents
Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method Download PDFInfo
- Publication number
- CN111666169A CN111666169A CN202010402904.2A CN202010402904A CN111666169A CN 111666169 A CN111666169 A CN 111666169A CN 202010402904 A CN202010402904 A CN 202010402904A CN 111666169 A CN111666169 A CN 111666169A
- Authority
- CN
- China
- Prior art keywords
- data
- abnormal
- gaussian distribution
- isolated forest
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- Complex Calculations (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection. The invention analyzes the platform data, adopts an improved joint detection model of isolated forest and Gaussian distribution to detect data abnormity, and adopts a detection model of Gaussian distribution and isolated forest combination to accurately detect abnormity of an industrial control system on the industrial internet platform in real time aiming at two continuous and discrete characteristics of the industrial internet platform, thereby finding out the abnormal state of the system in time. And comparing the detection effect of the model based on the combination of the isolated forest and the Gaussian distribution trained by the invention with the detection effect of a single Gaussian distribution or isolated forest model. The efficiency of data anomaly detection is improved in the detection of server platform data anomaly, anomaly can be found more quickly, time cost is saved, and the method has a wide application prospect.
Description
Technical Field
The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection of a service monitoring server platform.
Background
A server is one of computers that runs faster, is more heavily loaded, and is more expensive than a regular computer. The server provides calculation or application services for other clients (such as terminals like PC, smart phone, ATM and the like and even large equipment like train systems and the like) in the network. The server platform data is acquired by the data acquisition unit and comprises data of an industrial control system field device layer, a control system and a network layer, the data contains a plurality of important information, the most important information is that the abnormal state of the server system can be analyzed from the information, and an abnormal event of the server system is likely to cause huge production economic loss and even seriously endanger the life health and safety of people. In the prior art, most of judgment methods for server data abnormity are that a server system platform needs to analyze acquired data of a control system and a network layer to find whether the system is abnormal, most of platform data are high-dimensional and noisy data, and in a model training process for detecting the abnormity, the high-dimensional data can reduce the accuracy of a model, the calculation complexity is high, and the abnormity cannot be well detected, so that dimension reduction processing must be performed on the high-dimensional data.
Disclosure of Invention
The invention aims to provide a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for solving the problems.
The technical scheme of the invention is as follows: a joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution comprises the following specific steps:
step 1: is connected withEntering server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIn which xiAnd yiRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypAnd denoising the test data.
Step 2: data set A using LDA linear decision method1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypPerforming further dimensionality reduction, and inputting a data set A obtained by preliminarily removing noise points in Step11={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypTo use LDA linear decision method conveniently, set A is put1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypMerge into a new data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) In which s ishRepresenting A in two data sets1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypLinear data x of } linear data xiAnd yj,dhRepresenting linear data xiAnd yjW ═ o + p, dh∈ {0, 1}, when s ish∈ A, dhWhen s is equal to 0h∈ B, d h1 and then by the formulaAndrespectively calculating the inter-class dispersion matrix S of the data set DBAnd an intra-class dispersion matrix SWThen, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected toiReducing dimension and converting into a new sample data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) And obtaining the test data with more accurate dimension classification.
Step 3: sample data set after dimension reduction processing according to step2According to the method, a discrete variable refers to data of which the numerical value can only be described by natural numbers, integers, counting units and the like, a continuous variable refers to data which can be taken as values at will in a certain interval, two adjacent numerical values can be subjected to infinite segmentation (namely, infinite values can be taken) as the basis, data of a data set D are divided into data of continuous characteristics and data of discrete characteristics, an isolated forest model is used for training the continuous characteristic data, and whether the continuous characteristic data are normal or not is judged according to abnormal scores S (x, n) of the isolated forest; and training the discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the abnormal score T (r) of Gaussian distribution.
Step 4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
The specific steps of Step1 are as follows:
step1.1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIs given by the formulaSeparately evaluating each data xiAnd yiDensity value of rho (x)i) Or ρ (y)j)。
If ρ (x)i) If not less than 0, it is the core point of the data set A, if ρ (y)j) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formedcAnd BcInstead of the set A of core pointsncAnd BncIs shown asAnd
step1.2: finding a boundary set AbdAnd BbdIf xi∈AncAnd is andsatisfy q ∈ A and q ∈ Nx∩AcI.e. xiHas a core point, then xiThe boundary points are A, and the set formed by all the boundary points in A is collected and recorded as AbdSimilarly, B can be obtainedbd。
Step1.3: the core point set A obtained from Step1.1 and Step1.2c,BcAnd a set of boundary points Abd,BbdCalculatingAndif xi∈AnoiseThen call xiDeleting the noise points as the noise points; if yi∈BnoiseThen call yiIt is deleted as a noise point.
In Step2, S is obtainedBAnd SWThe method comprises the following specific steps:
step2.1: wherein SBAnd SWAre respectively defined as follows:
in the formula, arbitrary sample data xiIs an n-dimensional vector, niIndicates the number of the i-th type samples,denotes the jth sample, m, of the ith class of samplesiRepresents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.
According to the data set D {(s) described in step21,d1),(s2,d2)...(sh,dh)...(sw,dw) The number of classes c is 2, so only two classes of sample data need to be considered, shIs a one-dimensional vector, so the formula in Step2.1 can be expressed asAnd SB=(m1-m2)(m1-m2)TThe calculation amount can be reduced, and the calculation efficiency can be improved.
The specific steps of training the continuous characteristic data by using the isolated forest model in Step3 are as follows:
step3.1: there are many trees in an soliton forest, and each tree is a binary tree called iTree. The nodes in the iTree either have two children, a left child and a right child, or are themselves leaf nodes. Training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of the iTrees, and repeating the step for each tree to be trained so as to ensure that each tree is different.
Step3.1.2: randomly selecting data shAnd the value G of the feature.
Step3.1.3: and classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in the left subnode, and placing the sample with the value of the characteristic G larger than G in the right subnode.
Step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data exist in the nodes; alternatively, the height of the tree has reached the average height L of the tree in the isolated forest model.
Step3.2: and constructing each tree in the isolated forest as Step3.1 to form a forest, and after training the isolated forest model, further calculating the distance of a sample in each tree falling on a leaf node, and taking the distance as a basis for judging an abnormal point.
Step3.3: firstly, leaf nodes and root nodes are distinguished (the root node is the topmost node of the tree, and is a leaf node when no child nodes exist after continuous subdivision), the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.
the value range of S (x, n) is [0,1], and abnormal points are judged according to the value of S (x, n).
The specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:
step3.4: applying normal distribution in a Gaussian model (abnormal detection model), training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density functioni,σiThereby obtaining a probability distribution of the data.
If a discrete variablesiFollowing a gaussian distribution, the probability density function is then:
for mutually independent random variables { R) not subject to Gaussian distribution1,R2...Ri...RnIn which R isi~(μi~σi 2) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:
step3.5: when calculating the probability of the test set data using the probability density function p (r), if the probability density function p (r) is greater than 80%, the data is considered to follow a normal distribution.
If the probability density function P (r) < then, the data is considered abnormal data.
The gaussian distribution anomaly score is calculated as: t (r) 1-p (r).
The specific steps of Step4 are as follows:
performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
and if the comprehensive abnormal score M (x) is greater than or equal to the threshold value E, the data is considered to be abnormal data.
If the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
The invention provides a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for carrying out anomaly detection on data acquired on a server platform so as to accurately detect anomalies of a control system on the server platform in real time and find the anomalous state of platform data in time. According to the model, a DBSCAN clustering algorithm is used for preliminary denoising of data in the detection of data abnormity of a server platform, then the LDA linear judgment method is used for further data dimension reduction processing, so that the data in an original data space can be better classified in the dimension reduction process, low-dimensional data obtained through mapping can be distinguished more easily, the abnormal state of the data can be better analyzed later, the efficiency of data abnormity detection is improved, abnormity can be found more quickly, the time cost is saved, and the model has a wide application prospect.
The invention has the beneficial effects that:
1. in the process of model training for detecting the abnormality, the accuracy of the model can be reduced by high-dimensional data, the calculation complexity is high, and the abnormality cannot be well detected, so that the high-dimensional data is subjected to dimension reduction treatment by using a random forest model, the abnormal state of the data can be better analyzed by removing noise, and the abnormality detection model with high accuracy and recall rate is trained.
2. After noise is removed, feature extraction is carried out on the data subjected to noise reduction through a principal component analysis method, the data are converted into irrelevant low-dimensional new data, the new features are relatively independent and contain main information of the original data, and then an anomaly detection model with high accuracy and recall rate is trained.
3. And (3) comprehensively considering two models respectively trained by continuous features and discrete features of the data, carrying out weighted calculation on abnormal scores analyzed by the two models, and if the result is greater than a certain threshold value, considering the data as abnormal data. Experimental results show that the detection effect of the combined anomaly detection model based on the isolated forest and the Gaussian distribution trained by the invention is better than that of a single isolated forest or Gaussian distribution.
4. The invention can improve the efficiency of data anomaly detection, find anomalies more quickly, save time cost and have wide application prospect.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a diagram illustrating distribution of characteristic values of LDA obtained after random forest denoising according to an embodiment of the present invention;
fig. 3 is a diagram illustrating the distribution of characteristic values of LDA obtained without performing random forest denoising in the embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
As shown in fig. 1, a method for detecting abnormal joint data based on an improved isolated forest algorithm and gaussian distribution includes the following steps:
step 1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIn which xiAnd yiRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypAnd denoising the test data.
Step 2: data set A using LDA linear decision method1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypPerforming further dimension reduction treatment to the set A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypMerge into a new data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) In which s ishRepresenting A in two data sets1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypLinear data x of } linear data xiAnd yj,dhRepresenting linear numbersAccording to xiAnd yjW ═ o + p, dh∈ {0, 1}, when s ish∈ A, dhWhen s is equal to 0h∈ B, dh1 and then by the formulaAndrespectively calculating the inter-class dispersion matrix S of the data set DBAnd an intra-class dispersion matrix SWThen, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected toiReducing dimension and converting into a new sample data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) And obtaining the test data with more accurate dimension classification.
Step 3: sample data set after dimension reduction processing according to step2According to the method, a discrete variable refers to data of which the numerical value can only be described by natural numbers, integers, counting units and the like, a continuous variable refers to data which can be taken as values at will in a certain interval, two adjacent numerical values can be subjected to infinite segmentation (namely, infinite values can be taken) as the basis, data of a data set D are divided into data of continuous characteristics and data of discrete characteristics, an isolated forest model is used for training the continuous characteristic data, and whether the continuous characteristic data are normal or not is judged according to abnormal scores S (x, n) of the isolated forest; and training the discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the abnormal score T (r) of Gaussian distribution.
Step 4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
The specific steps of Step1 are as follows:
step1.1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIs given by the formulaSeparately evaluating each data xiAnd yiDensity value of rho (x)i) Or ρ (y)j)。
If ρ (x)i) If not less than 0, it is the core point of the data set A, if ρ (y)j) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formedcAnd BcInstead of the set A of core pointsncAnd BncIs shown asAnd
step1.2: finding a boundary set AbdAnd BbdIf xi∈AncAnd is andsatisfy q ∈ A and q ∈ Nx∩AcI.e. xiHas a core point, then xiThe boundary points are A, and the set formed by all the boundary points in A is collected and recorded as AbdSimilarly, B can be obtainedbd。
Step1.3: the core point set A obtained from Step1.1 and Step1.2c,BcAnd a set of boundary points Abd,BbdCalculatingAndif xi∈AnoiseThen call xiDeleting the noise points as the noise points; if yi∈BnoiseThen call yiIt is deleted as a noise point.
In Step2, S is obtainedBAnd SWThe method comprises the following specific steps:
step2.1: wherein SBAnd SWAre respectively defined as follows:
SB=(m1-m2)(m1-m2)T
in the formula, arbitrary sample data xiIs an n-dimensional vector, niIndicates the number of the i-th type samples,denotes the jth sample, m, of the ith class of samplesiRepresents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.
The specific steps of training the continuous characteristic data by using the isolated forest model in Step3 are as follows:
step3.1: there are many trees in an soliton forest, and each tree is a binary tree called iTree. The nodes in the iTree either have two children, a left child and a right child, or are themselves leaf nodes. Training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of the iTrees, and repeating the step for each tree to be trained so as to ensure that each tree is different.
Step3.1.2: randomly selecting data shAnd the value G of the feature.
Step3.1.3: and classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in the left subnode, and placing the sample with the value of the characteristic G larger than G in the right subnode.
Step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data exist in the nodes; alternatively, the height of the tree has reached the average height L of the tree in the isolated forest model.
Step3.2: and constructing each tree in the isolated forest as Step3.1 to form a forest, and after training the isolated forest model, further calculating the distance of a sample in each tree falling on a leaf node, and taking the distance as a basis for judging an abnormal point.
Step3.3: firstly, leaf nodes and root nodes are distinguished (the root node is the topmost node of the tree, and is a leaf node when no child nodes exist after continuous subdivision), the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.
the value range of S (x, n) is [0,1], and the abnormal point is judged according to the value of S (x, n), and the closer to 1, the more likely the abnormal point is.
The specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:
step3.4: applying normal distribution in a Gaussian model (abnormal detection model), training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density functioni,σiThereby obtaining a probability distribution of the data.
If a discrete variable siFollowing a gaussian distribution, the probability density function is then:
for mutually independent randoms not obeying Gaussian distributionVariable { R1,R2...Ri...RnIn which R isi~(μi~σi 2) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:
step3.5: when calculating the probability of the test set data using the probability density function p (r), if the probability density function p (r) is greater than 80%, the data is considered to follow a normal distribution.
If the probability density function P (r) is less than the threshold, the data is considered as abnormal data; and for the value, the tolerance of the abnormal data is selected according to the requirement according to the abnormal detection, and the value is 0.05.
The gaussian distribution anomaly score is calculated as: t (r) 1-p (r).
The abnormal condition is determined according to the Gaussian distribution abnormal score, and the closer to 1, the T (r) is, the more abnormal points are.
The specific steps of Step4 are as follows:
performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
the threshold E for the anomaly score was 0.45 by experiment on the test set.
And if the comprehensive abnormal score M (x) is greater than or equal to the threshold value E, the data is considered to be abnormal data.
If the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
The invention uses KDDCUP99 as the data set of the network layer, TEP as the data set of the controller and the field device layer to simulate the data set collected by the system server platform, firstly uses DBSCAN clustering algorithm to primarily denoise the two data sets, finds out noise points and removes the noise points, then integrates the data set collected by the simulation system server platform into a data set by using KDDCUP99 as the data set of the network layer and the TEP as the data set of the controller and the field device layer, and classifies the data, and then classifies and extracts the data by LDA linear judgment method, the data characteristics can be transformed into 11 dimensions, the 11-dimensional characteristics are not related to each other, and represent the main information of the original data.
The specific result is shown in fig. 2 and fig. 3, compared with the method of directly using the LDA linear decision after filtering out the noise by the DBSCAN clustering algorithm, after filtering the features by the DBSCAN clustering algorithm, the covariance matrix feature value distribution of the sample is more obvious, and the dimension reduction effect on the data is better. Generally, the method for performing dimensionality reduction on server platform data is mainly a Principal Component Analysis (PCA) method, but classification information is not considered, so that after dimensionality reduction is performed, although information loss is reduced to be low, classification may become more difficult. And the data in the original data space can be better classified in the dimension reduction process by using the LDA linear judgment method, and the low-dimensional data obtained by mapping is easier to distinguish.
And selecting an attack type PROBE in a KDDCUP99 data set and normal data for the data after dimension reduction to train the isolated forest model. And comprehensively considering the continuous characteristic and discrete characteristic training results of the data, and comprehensively analyzing the accuracy, the recall rate and the F1 value of the experimental results. Experiments compare the accuracy, recall and F1 values of the improved isolated forest algorithm and gaussian distributed joint data anomaly detection model with the isolated forest model alone, through different samples of itrees in the isolated forest. And comparing the performance of the improved isolated forest algorithm and the performance of the Gaussian distribution combined data anomaly detection model with that of the single Gaussian distribution model.
When the number of samples of the tree is 250, the accuracy rate and the recall rate of a combined data anomaly detection model of an improved isolated forest algorithm and Gaussian distribution of the TEP data set and an F1 value are [0.9178743961352657,0.95 and 0.9336609336609336], the accuracy rate and the recall rate of the Gaussian distribution of the TEP data set and an F1 value are [ 0.8115942028985508, 0.84 and 0.8255528255528256 ]. From the experiment results of the KDDCUP99 data set and the TEP data set, the accuracy, the recall rate and the F1 value of the improved isolated forest algorithm and the combined data anomaly detection model of Gaussian distribution are better than those of the single detection result of the isolated forest model or the Gaussian distribution model, and experiments show that the model has the best effect when the number of iTrees in the isolated forest is 250.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.
Claims (6)
1. A joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution is characterized in that:
step 1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIn which xiAnd yiRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypDenoising the test data;
step 2: data set A using LDA linear decision method1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypPerforming dimension reduction treatment to the set A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypMerge into a new data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) In which s ishRepresenting A in two data sets1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypLinear data x of } linear data xiAnd yj,dhRepresenting linear data xiAnd yjW ═ o + p, dh∈ {0, 1}, when s ish∈ A, dhWhen s is equal to 0h∈ B, dh1 and then by the formulaAndrespectively calculating the inter-class dispersion matrix S of the data set DBAnd an intra-class dispersion matrix SWThen, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected toiReducing dimension and converting into a new sample data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) Obtaining more accurate test data of dimension classification;
step 3: and D' {(s) of the sample data set subjected to dimensionality reduction processing according to step21,d1),(s2,d2)...(sh,dh)...(sw,dw) Dividing data of the data set D into data of continuous features and data of discrete features, training the continuous feature data by using an isolated forest model, and judging whether the continuous feature data are normal or not according to abnormal scores S (x, n) of the isolated forest; training discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the Gaussian distribution abnormal score T (r);
step 4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
2. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step1 comprises the following specific steps:
step1.1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIs given by the formulaSeparately evaluating each data xiAnd yiDensity value of rho (x)i) Or ρ (y)j);
If ρ (x)i) If not less than 0, it is the core point of the data set A, if ρ (y)j) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formedcAnd BcInstead of the set A of core pointsncAnd BncIs shown asAnd
step1.2: finding a boundary set AbdAnd BbdIf xi∈AncAnd is andsatisfy q ∈ A and q ∈ Nx∩AcI.e. xiHas a core point, then xiThe boundary points are A, and the set formed by all the boundary points in A is collected and recorded as AbdSimilarly, B can be obtainedbd;
3. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein S is obtained in Step2BAnd SWThe method comprises the following specific steps:
step2.1: wherein SBAnd SWAre respectively defined as follows:
SB=(m1-m2)(m1-m2)T
4. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the specific steps of training the continuous feature data by using the isolated forest model in Step3 are as follows:
step3.1: training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of iTrees, and repeating the step for each tree to ensure that each tree is different;
step3.1.2: randomly selecting data shA characteristic G and a value G of the characteristic;
step3.1.3: classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in a left subnode, and placing the sample with the value of the characteristic G larger than G in a right subnode;
step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data or tree heights in the nodes reach the average height L of the trees in the isolated forest model;
step3.2: constructing each tree in the solitary forest as Step3.1 to form a forest, and after training the model of the solitary forest, further calculating the distance of a sample in each tree falling on a leaf node and taking the distance as a basis for judging an abnormal point;
step3.3: firstly, distinguishing leaf nodes and root nodes, wherein the path length from a leaf node to a root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in a forest is recorded as c (x), and E (h (x)) is the expectation of all paths of a sample in the forest;
the value range of S (x, n) is [0,1], and abnormal points are judged according to the value of S (x, n).
5. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step3 for training the discrete feature data by using the Gaussian model comprises the following specific steps:
step3.4: applying normal distribution in a Gaussian model, training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density functioni,σiThereby obtaining a probability distribution of the data;
if a discrete variable siFollowing a gaussian distribution, the probability density function is then:
for mutually independent random variables { R) not subject to Gaussian distribution1,R2...Ri...RnIn which R isi~(μi~σi 2) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:
step3.5: when calculating the probability of the test set data by using a probability density function P (r), if the probability density function P (r) is more than 80%, the data is considered to be subjected to normal distribution, and if the probability density function P (r) is less, the data is considered to be abnormal data;
the gaussian distribution anomaly score is calculated as: t (r) 1-p (r).
6. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step4 comprises the following specific steps:
performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
if the comprehensive abnormal score M (x) is larger than or equal to the threshold value E, the data is considered to be abnormal data;
if the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010402904.2A CN111666169B (en) | 2020-05-13 | 2020-05-13 | Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010402904.2A CN111666169B (en) | 2020-05-13 | 2020-05-13 | Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111666169A true CN111666169A (en) | 2020-09-15 |
CN111666169B CN111666169B (en) | 2023-03-28 |
Family
ID=72383483
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010402904.2A Active CN111666169B (en) | 2020-05-13 | 2020-05-13 | Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111666169B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112185484A (en) * | 2020-10-13 | 2021-01-05 | 华北科技学院 | AdaBoost model-based water quality characteristic mineral water classification method |
CN112711577A (en) * | 2020-12-17 | 2021-04-27 | 航天信息股份有限公司 | Abnormal transaction enterprise identification method and device |
CN112905583A (en) * | 2021-04-01 | 2021-06-04 | 辽宁工程技术大学 | High-dimensional big data outlier detection method |
CN112990314A (en) * | 2021-03-16 | 2021-06-18 | 中国科学院长春光学精密机械与物理研究所 | Hyperspectral image anomaly detection method and device based on improved isolated forest algorithm |
CN113723477A (en) * | 2021-08-16 | 2021-11-30 | 同盾科技有限公司 | Cross-feature federal abnormal data detection method based on isolated forest |
CN114826971A (en) * | 2022-06-28 | 2022-07-29 | 苏州浪潮智能科技有限公司 | Server abnormity detection method, device, equipment and readable storage medium |
CN114925731A (en) * | 2022-06-06 | 2022-08-19 | 华电金沙江上游水电开发有限公司叶巴滩分公司 | Method for detecting abnormal value of monitoring data of flexible inclinometer |
CN115021679A (en) * | 2022-08-09 | 2022-09-06 | 国网山西省电力公司大同供电公司 | Photovoltaic equipment fault detection method based on multi-dimensional outlier detection |
CN115760484A (en) * | 2022-12-07 | 2023-03-07 | 湖北华中电力科技开发有限责任公司 | Method, device and system for improving potential danger identification capability of power distribution station area and storage medium |
CN115457414B (en) * | 2022-09-15 | 2023-05-05 | 西华大学 | Unmanned aerial vehicle abnormal behavior identification method based on improved residual error network |
CN117114677A (en) * | 2023-07-12 | 2023-11-24 | 北京中盛九橙企业管理咨询有限公司 | Digital wallet management system and method |
CN117238058A (en) * | 2023-11-10 | 2023-12-15 | 无锡明诚汽车部件有限公司 | Starter monitoring method for automobile based on data analysis |
CN117874653A (en) * | 2024-03-11 | 2024-04-12 | 武汉佳华创新电气有限公司 | Power system safety monitoring method and system based on multi-source data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108777873A (en) * | 2018-06-04 | 2018-11-09 | 江南大学 | The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend |
CN109345137A (en) * | 2018-10-22 | 2019-02-15 | 广东精点数据科技股份有限公司 | A kind of rejecting outliers method based on agriculture big data |
CN110046665A (en) * | 2019-04-17 | 2019-07-23 | 成都信息工程大学 | Based on isolated two abnormal classification point detecting method of forest, information data processing terminal |
-
2020
- 2020-05-13 CN CN202010402904.2A patent/CN111666169B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108777873A (en) * | 2018-06-04 | 2018-11-09 | 江南大学 | The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend |
CN109345137A (en) * | 2018-10-22 | 2019-02-15 | 广东精点数据科技股份有限公司 | A kind of rejecting outliers method based on agriculture big data |
CN110046665A (en) * | 2019-04-17 | 2019-07-23 | 成都信息工程大学 | Based on isolated two abnormal classification point detecting method of forest, information data processing terminal |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112185484A (en) * | 2020-10-13 | 2021-01-05 | 华北科技学院 | AdaBoost model-based water quality characteristic mineral water classification method |
CN112711577A (en) * | 2020-12-17 | 2021-04-27 | 航天信息股份有限公司 | Abnormal transaction enterprise identification method and device |
CN112990314B (en) * | 2021-03-16 | 2023-04-18 | 中国科学院长春光学精密机械与物理研究所 | Hyperspectral image anomaly detection method and device based on improved isolated forest algorithm |
CN112990314A (en) * | 2021-03-16 | 2021-06-18 | 中国科学院长春光学精密机械与物理研究所 | Hyperspectral image anomaly detection method and device based on improved isolated forest algorithm |
CN112905583A (en) * | 2021-04-01 | 2021-06-04 | 辽宁工程技术大学 | High-dimensional big data outlier detection method |
CN113723477A (en) * | 2021-08-16 | 2021-11-30 | 同盾科技有限公司 | Cross-feature federal abnormal data detection method based on isolated forest |
CN113723477B (en) * | 2021-08-16 | 2024-04-30 | 同盾科技有限公司 | Cross-feature federal abnormal data detection method based on isolated forest |
CN114925731A (en) * | 2022-06-06 | 2022-08-19 | 华电金沙江上游水电开发有限公司叶巴滩分公司 | Method for detecting abnormal value of monitoring data of flexible inclinometer |
CN114925731B (en) * | 2022-06-06 | 2024-05-31 | 华电金沙江上游水电开发有限公司叶巴滩分公司 | Method for detecting abnormal value of monitoring data of flexible inclinometer |
CN114826971A (en) * | 2022-06-28 | 2022-07-29 | 苏州浪潮智能科技有限公司 | Server abnormity detection method, device, equipment and readable storage medium |
WO2024001254A1 (en) * | 2022-06-28 | 2024-01-04 | 苏州元脑智能科技有限公司 | Server anomaly detection method and apparatus, device, and readable storage medium |
CN115021679B (en) * | 2022-08-09 | 2022-11-04 | 国网山西省电力公司大同供电公司 | Photovoltaic equipment fault detection method based on multi-dimensional outlier detection |
CN115021679A (en) * | 2022-08-09 | 2022-09-06 | 国网山西省电力公司大同供电公司 | Photovoltaic equipment fault detection method based on multi-dimensional outlier detection |
CN115457414B (en) * | 2022-09-15 | 2023-05-05 | 西华大学 | Unmanned aerial vehicle abnormal behavior identification method based on improved residual error network |
CN115760484A (en) * | 2022-12-07 | 2023-03-07 | 湖北华中电力科技开发有限责任公司 | Method, device and system for improving potential danger identification capability of power distribution station area and storage medium |
CN117114677A (en) * | 2023-07-12 | 2023-11-24 | 北京中盛九橙企业管理咨询有限公司 | Digital wallet management system and method |
CN117238058A (en) * | 2023-11-10 | 2023-12-15 | 无锡明诚汽车部件有限公司 | Starter monitoring method for automobile based on data analysis |
CN117238058B (en) * | 2023-11-10 | 2024-01-26 | 无锡明诚汽车部件有限公司 | Starter monitoring method for automobile based on data analysis |
CN117874653A (en) * | 2024-03-11 | 2024-04-12 | 武汉佳华创新电气有限公司 | Power system safety monitoring method and system based on multi-source data |
CN117874653B (en) * | 2024-03-11 | 2024-05-31 | 武汉佳华创新电气有限公司 | Power system safety monitoring method and system based on multi-source data |
Also Published As
Publication number | Publication date |
---|---|
CN111666169B (en) | 2023-03-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111666169B (en) | Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method | |
CN108737406B (en) | Method and system for detecting abnormal flow data | |
CN110263846B (en) | Fault diagnosis method based on fault data deep mining and learning | |
CN107493277B (en) | Large data platform online anomaly detection method based on maximum information coefficient | |
CN111798312A (en) | Financial transaction system abnormity identification method based on isolated forest algorithm | |
CN111556016B (en) | Network flow abnormal behavior identification method based on automatic encoder | |
CN102176698A (en) | Method for detecting abnormal behaviors of user based on transfer learning | |
CN112039903B (en) | Network security situation assessment method based on deep self-coding neural network model | |
CN111915418A (en) | Internet financial fraud online detection method and device | |
CN117421684A (en) | Abnormal data monitoring and analyzing method based on data mining and neural network | |
CN112580749B (en) | Intelligent fire detection method based on machine olfaction technology | |
CN113556319B (en) | Intrusion detection method based on long-short term memory self-coding classifier under internet of things | |
CN114124482A (en) | Access flow abnormity detection method and device based on LOF and isolated forest | |
CN112529638B (en) | Service demand dynamic prediction method and system based on user classification and deep learning | |
CN113489685A (en) | Secondary feature extraction and malicious attack identification method based on kernel principal component analysis | |
CN112199670A (en) | Log monitoring method for improving IFOREST (entry face detection sequence) to conduct abnormity detection based on deep learning | |
CN111600878A (en) | Low-rate denial of service attack detection method based on MAF-ADM | |
CN115438102A (en) | Space-time data anomaly identification method and device and electronic equipment | |
CN115659143A (en) | Fault real-time diagnosis method based on experimental design | |
CN110189799B (en) | Metagenome feature selection method based on variable importance score and Neyman Pearson test | |
CN118194186A (en) | Snail meat feed production management method and system based on Internet of things | |
CN116318925A (en) | Multi-CNN fusion intrusion detection method, system, medium, equipment and terminal | |
CN113722230B (en) | Integrated evaluation method and device for vulnerability mining capability of fuzzy test tool | |
CN113780084B (en) | Face data amplification method based on generation type countermeasure network, electronic equipment and storage medium | |
CN113010673A (en) | Vulnerability automatic classification method based on entropy optimization support vector machine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |