CN111666169B - Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method - Google Patents

Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method Download PDF

Info

Publication number
CN111666169B
CN111666169B CN202010402904.2A CN202010402904A CN111666169B CN 111666169 B CN111666169 B CN 111666169B CN 202010402904 A CN202010402904 A CN 202010402904A CN 111666169 B CN111666169 B CN 111666169B
Authority
CN
China
Prior art keywords
data
gaussian distribution
abnormal
isolated forest
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010402904.2A
Other languages
Chinese (zh)
Other versions
CN111666169A (en
Inventor
田园
原野
马文
黄祖源
郭一翰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Center of Yunnan Power Grid Co Ltd
Original Assignee
Information Center of Yunnan Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Center of Yunnan Power Grid Co Ltd filed Critical Information Center of Yunnan Power Grid Co Ltd
Priority to CN202010402904.2A priority Critical patent/CN111666169B/en
Publication of CN111666169A publication Critical patent/CN111666169A/en
Application granted granted Critical
Publication of CN111666169B publication Critical patent/CN111666169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection. The invention analyzes the platform data, adopts an improved joint detection model of isolated forest and Gaussian distribution to detect data abnormity, and adopts a detection model of Gaussian distribution and isolated forest combination to accurately detect abnormity of an industrial control system on the industrial internet platform in real time aiming at two continuous and discrete characteristics of the industrial internet platform, thereby finding out the abnormal state of the system in time. And comparing the detection effect of the model based on the combination of the isolated forest and the Gaussian distribution trained by the invention with the detection effect of a single Gaussian distribution or isolated forest model. The efficiency of data anomaly detection is improved in the detection of server platform data anomaly, anomaly can be found more quickly, time cost is saved, and the method has a wide application prospect.

Description

Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method
Technical Field
The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection of a service monitoring server platform.
Background
A server is one of computers that runs faster, is more heavily loaded, and is more expensive than a regular computer. The server provides calculation or application services for other clients (such as terminals like PC, smart phone, ATM and the like and even large equipment like train system and the like) in the network. The server platform data is acquired by the data acquisition unit and comprises data of an industrial control system field device layer, a control system and a network layer, the data contains a plurality of important information, the most important information is that the abnormal state of the server system can be analyzed from the information, and an abnormal event of the server system is likely to cause huge production economic loss and even seriously endanger the life health and safety of people. In the prior art, most of judgment methods for server data abnormity are that a server system platform needs to analyze acquired data of a control system and a network layer to find whether the system is abnormal, most of platform data are high-dimensional and noisy data, and in a model training process for detecting the abnormity, the high-dimensional data can reduce the accuracy of a model, the calculation complexity is high, and the abnormity cannot be well detected, so that dimension reduction processing must be performed on the high-dimensional data.
Disclosure of Invention
The invention aims to provide a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for solving the problems.
The technical scheme of the invention is as follows: a joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution comprises the following specific steps:
step1: accessing server platform data, selecting two linear data sets A = { x = { (X) } 1 ,x 2 ....x i ...x n } and B = { y 1 ,y 2 .....y j .....y n In which x is i And y i Respectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A 1 ={x 1 ,x 2 ....x i ...x o } and B 1 ={y 1 ,y 2 .....y j .....y p And denoising the test data.
Step2: data set A using LDA linear decision method 1 ={x 1 ,x 2 ....x i ...x o } and B 1 ={y 1 ,y 2 .....y j .....y p Performing further dimensionality reduction, and inputting a data set A obtained by preliminarily removing noise points in Step1 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p To use LDA linear decision method conveniently, set A is put 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p Merge into a new data set D = {(s) 1 ,d 1 ),(s 2 ,d 2 )...(s h ,d h )...(s w ,d w ) In which s is h Representing A in two data sets 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p Linear data x of } linear data x i And y j ,d h Representing linear data x i And y j W = o + p, d h E {0,1}, when s h When epsilon is A, d h Is =0, when s h When epsilon is B, d h =1, then by the formula
Figure BDA0002490178760000021
And &>
Figure BDA0002490178760000022
Respectively calculating the inter-class dispersion matrix S of the data set D B And an intra-class dispersion matrix S W Then, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected to i Dimension reduction is carried out, and conversion is carried out to a new sample data set D' = {(s) 1 ,d 1 ),(s 2 ,d 2 )...(s h ,d h )...(s w ,d w ) And obtaining the test data with more accurate dimension classification.
Step3: sample data set after dimension reduction processing according to step2
Figure BDA0002490178760000027
According to the method, a discrete variable refers to data of which the numerical value can only be described by natural numbers, integers, counting units and the like, a continuous variable refers to data which can be taken as values at will in a certain interval, two adjacent numerical values can be subjected to infinite segmentation (namely, infinite values can be taken) as the basis, data of a data set D are divided into data of continuous characteristics and data of discrete characteristics, an isolated forest model is used for training the continuous characteristic data, and whether the continuous characteristic data are normal or not is judged according to abnormal scores S (x, n) of the isolated forest; and training the discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the abnormal score T (r) of Gaussian distribution.
Step4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
The specific steps of Step1 are as follows:
step1.1: accessing server platform data, selecting two linear data sets A = { x = { (X) } 1 ,x 2 ....x i ...x n } and B = { y 1 ,y 2 .....y j .....y n Is given by the formula
Figure BDA0002490178760000023
Separately evaluating each data x i And y i Density value of rho (x) i ) Or ρ (y) j )。
If ρ (x) i ) If not less than 0, it is the core point of the data set A, if ρ (y) j ) If the number is more than or equal to 0, the core point is the core point of the data set B, all the core points are recorded and collected to form a set A c And B c Instead of the set A of core points nc And B nc Is represented as
Figure BDA0002490178760000024
And &>
Figure BDA0002490178760000025
Step1.2: finding a boundary set A bd And B bd If x i ∈A nc And is and
Figure BDA0002490178760000026
satisfy q ∈ A and q ∈ N x ∩A c I.e. x i Has a core point, then x i The boundary points are A, and the set formed by all the boundary points in A is collected and recorded as A bd In the same way, B can be obtained bd
Step1.3: the core point set A obtained from Step1.1 and Step1.2 c ,B c And a set of boundary points A bd ,B bd Calculating
Figure BDA0002490178760000031
And &>
Figure BDA0002490178760000032
If x i ∈A noise Then call x i Deleting the noise points as the noise points; if y i ∈B noise Then call y i It is deleted as a noise point.
Obtaining S in Step2 B And S W The method comprises the following specific steps:
step2.1: wherein S B And S W Are respectively defined as follows:
Figure BDA0002490178760000033
Figure BDA0002490178760000034
in the formula, arbitrary sample data x i Is an n-dimensional vector, n i Indicates the number of the i-th type samples,
Figure BDA0002490178760000035
representing the ith in the ith class samplej samples, m i Represents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes. />
According to the data set D = {(s) described in step2 1 ,d 1 ),(s 2 ,d 2 )...(s h ,d h )...(s w ,d w ) The number of classes c is 2, so only two classes of sample data need to be considered, s h Is a one-dimensional vector, so the formula in Step2.1 can be expressed as
Figure BDA0002490178760000036
And S B =(m 1 -m 2 )(m 1 -m 2 ) T The calculation amount can be reduced, and the calculation efficiency can be improved.
The specific steps of training the continuous characteristic data by using the isolated forest model in Step3 are as follows:
step3.1: there are many trees in the solitary forest, and each tree is a binary tree called iTree. The nodes in the iTree either have two children, a left child and a right child, or are themselves leaf nodes. Training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of iTrees, and repeating the step for each tree to ensure that each tree is different.
Step3.1.2: randomly selecting data s h And a value G of the feature.
Step3.1.3: and classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in the left subnode, and placing the sample with the value of the characteristic G larger than G in the right subnode.
Step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data exist in the nodes; alternatively, the height of the tree has reached the average height L of the tree in the isolated forest model.
Step3.2: and constructing each tree in the isolated forest as Step3.1 to form a forest, and after training the isolated forest model, further calculating the distance of a sample in each tree falling on a leaf node, and taking the distance as a basis for judging an abnormal point.
Step3.3: first, leaf nodes and root nodes (the root node is the topmost node of the tree and is a leaf node when no more child nodes exist in the tree after being subdivided continuously) are distinguished, the path length from a leaf node to a root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.
The anomaly score is:
Figure BDA0002490178760000041
the value range of S (x, n) is [0,1], and abnormal points are judged according to the value of S (x, n).
The specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:
step3.4: applying normal distribution in Gaussian model (abnormal detection model), training probability distribution model with training data set, and calculating to obtain parameter mu in probability density function i ,σ i Thereby obtaining a probability distribution of the data.
If a discrete variable s i Following a gaussian distribution, the probability density function is then:
Figure BDA0002490178760000042
for mutually independent random variables { R) not subject to Gaussian distribution 1 ,R 2 ...R i ...R n In which R is i ~(μi~σ i 2 ) Then the joint distribution of the n random variables is multidimensional normal distribution, and the probability density function is:
Figure BDA0002490178760000043
step3.5: when calculating the probability of test set data with a probability density function P (r), if the probability density function P (r) is greater than 80%, the data is considered to follow a normal distribution.
If the probability density function P (r) < ε, the data is considered abnormal data.
The gaussian distribution anomaly score is calculated as: t (r) =1-p (r).
The specific steps of Step4 are as follows:
performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
Figure BDA0002490178760000051
and if the comprehensive abnormal score M (x) is larger than or equal to the threshold value E, the data is considered to be abnormal data.
If the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
The invention provides a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for carrying out anomaly detection on data acquired on a server platform so as to accurately detect anomalies of a control system on the server platform in real time and find the anomalous state of platform data in time. According to the model, a DBSCAN clustering algorithm is used for preliminary denoising of data in the detection of data abnormity of a server platform, then the LDA linear judgment method is used for further data dimension reduction processing, so that the data in an original data space can be better classified in the dimension reduction process, low-dimensional data obtained through mapping can be distinguished more easily, the abnormal state of the data can be better analyzed later, the efficiency of data abnormity detection is improved, abnormity can be found more quickly, the time cost is saved, and the model has a wide application prospect.
The invention has the beneficial effects that:
1. in the process of model training for detecting the abnormality, the accuracy of the model can be reduced by high-dimensional data, the calculation complexity is high, and the abnormality cannot be well detected, so that the high-dimensional data is subjected to dimension reduction treatment by using a random forest model, the abnormal state of the data can be better analyzed by removing noise, and the abnormality detection model with high accuracy and recall rate is trained.
2. After noise is removed, feature extraction is carried out on the data subjected to noise reduction through a principal component analysis method, the data are converted into irrelevant low-dimensional new data, the new features are relatively independent and contain main information of the original data, and then an anomaly detection model with high accuracy and recall rate is trained.
3. And comprehensively considering two models respectively trained by the continuous characteristic and the discrete characteristic of the data, performing weighted calculation on the abnormal scores analyzed by the two models, and if the result is greater than a certain threshold value, determining that the data is abnormal data. Experimental results show that the detection effect of the combined anomaly detection model based on the isolated forest and the Gaussian distribution trained by the invention is better than that of a single isolated forest or Gaussian distribution.
4. The invention can improve the efficiency of data anomaly detection, find anomalies more quickly, save time cost and have wide application prospect.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a diagram illustrating distribution of characteristic values of LDA obtained after random forest denoising according to an embodiment of the present invention;
fig. 3 is a diagram illustrating the distribution of characteristic values of LDA obtained without performing random forest denoising in the embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
As shown in fig. 1, a method for detecting abnormal joint data based on an improved isolated forest algorithm and gaussian distribution includes the following steps:
step1: accessing server platform data, selecting two linear data sets A = { x = 1 ,x 2 ....x i ...x n } and B = { y 1 ,y 2 .....y j .....y n In which x is i And y i Respectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p And removing noise from the test data.
Step2: data set A using LDA linear decision method 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p Performing further dimension reduction treatment to the set A 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p Merging into a new data set D = {(s) 1 ,d 1 ),(s 2 ,d 2 )...(s h ,d h )...(s w ,d w ) In which s is h Representing A in two data sets 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p Linear data x of } linear data x i And y j ,d h Representing linear data x i And y j W = o + p, d h E {0,1}, when s h When being e.g. A, d h When s is =0 h When being e.g. B, d h =1, then by the formula
Figure BDA0002490178760000061
And &>
Figure BDA0002490178760000062
Respectively calculating the inter-class dispersion matrix S of the data set D B And an intra-class dispersion matrix S W Then, the projection matrix W of the data set D is obtained, and finallyFor each feature data x in the data set D i Dimension reduction is carried out, and conversion is carried out to a new sample data set D' = {(s) 1 ,d 1 ),(s 2 ,d 2 )...(s h ,d h )...(s w ,d w ) And obtaining the test data with more accurate dimension classification.
Step3: sample data set after dimension reduction processing according to step2
Figure BDA0002490178760000063
According to the method, a discrete variable refers to data of which the numerical value can only be described by natural numbers, integers, counting units and the like, a continuous variable refers to data which can be taken as values at will in a certain interval, two adjacent numerical values can be subjected to infinite segmentation (namely, infinite values can be taken) as the basis, data of a data set D are divided into data of continuous characteristics and data of discrete characteristics, an isolated forest model is used for training the continuous characteristic data, and whether the continuous characteristic data are normal or not is judged according to abnormal scores S (x, n) of the isolated forest; and training the discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the abnormal score T (r) of Gaussian distribution.
Step4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
The specific steps of Step1 are as follows:
step1.1: accessing server platform data, selecting two linear data sets A = { x = { (X) } 1 ,x 2 ....x i ...x n } and B = { y 1 ,y 2 .....y j .....y n Is given by the formula
Figure BDA0002490178760000071
Separately evaluating each data x i And y i Density value of rho (x) i ) Or ρ (y) j )。
If ρ (x) i ) If not less than 0, it is the core point of the data set A, if ρ (y) j ) If it is greater than or equal to 0, it is dataThe core points of the set B are recorded and collected to form a set A c And B c Instead of the set A of core points nc And B nc Is represented as
Figure BDA0002490178760000072
And &>
Figure BDA0002490178760000073
Step1.2: finding a boundary set A bd And B bd If x i ∈A nc And is and
Figure BDA0002490178760000074
satisfy q ∈ A and q ∈ N x ∩A c I.e. x i Has a core point, then x i The boundary points are A, and the set formed by all the boundary points in A is collected and recorded as A bd Similarly, B can be obtained bd 。/>
Step1.3: the core point set A obtained from Step1.1 and Step1.2 c ,B c And a set of boundary points A bd ,B bd Calculating
Figure BDA0002490178760000075
And &>
Figure BDA0002490178760000076
If x i ∈A noise Then call x i Deleting the noise points as the noise points; if y i ∈B noise Then call y i It is deleted as a noise point.
Obtaining S in Step2 B And S W The method comprises the following specific steps:
step2.1: wherein S B And S W Are respectively defined as follows:
Figure BDA0002490178760000077
S B =(m 1 -m 2 )(m 1 -m 2 ) T
in the formula, arbitrary sample data x i Is an n-dimensional vector, n i Indicates the number of the i-th type samples,
Figure BDA0002490178760000078
denotes the jth sample, m, of the ith class of samples i Represents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.
The specific steps of training the continuous characteristic data by using the isolated forest model in Step3 are as follows:
step3.1: there are many trees in the solitary forest, and each tree is a binary tree called iTree. The nodes in the iTree either have two children, a left child and a right child, or are themselves leaf nodes. Training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of the iTrees, and repeating the step for each tree to be trained so as to ensure that each tree is different.
Step3.1.2: randomly selecting data s h And the value G of the feature.
Step3.1.3: and classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in the left subnode, and placing the sample with the value of the characteristic G larger than G in the right subnode.
Step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data exist in the nodes; alternatively, the height of the tree has reached the average height L of the tree in the isolated forest model.
Step3.2: and constructing each tree in the isolated forest as Step3.1 to form a forest, and after training the isolated forest model, further calculating the distance of a sample in each tree falling on a leaf node, and taking the distance as a basis for judging an abnormal point.
Step3.3: firstly, leaf nodes and root nodes are distinguished (the root node is the topmost node of the tree, and is the leaf node when no child nodes exist after continuous subdivision), the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.
The anomaly score is:
Figure BDA0002490178760000081
the value range of S (x, n) is [0,1], and the abnormal point is judged according to the value of S (x, n), and the closer to 1, the more possible the abnormal point is.
The specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:
step3.4: applying normal distribution in a Gaussian model (abnormal detection model), training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density function i ,σ i Thereby obtaining a probability distribution of the data.
If a discrete variable s i Obeying a gaussian distribution, the probability density function is:
Figure BDA0002490178760000082
for mutually independent random variables { R) not subject to Gaussian distribution 1 ,R 2 ...R i ...R n In which R is i ~(μi~σ i 2 ) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:
Figure BDA0002490178760000083
step3.5: when calculating the probability of test set data with a probability density function P (r), if the probability density function P (r) is greater than 80%, the data is considered to follow a normal distribution.
If the probability density function P (r) < epsilon, the data is considered as abnormal data; and selecting the value of epsilon according to the tolerance of the anomaly detection to the anomaly data and the requirement, wherein the value of epsilon is 0.05.
The gaussian distribution anomaly score is calculated as: t (r) =1-p (r).
The abnormal condition is determined according to the abnormal fraction of the Gaussian distribution, and the closer T (r) is to 1, the more abnormal points are.
The specific steps of Step4 are as follows:
performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
Figure BDA0002490178760000091
the threshold E for the anomaly score was 0.45 by experiment on the test set.
And if the comprehensive abnormal score M (x) is larger than or equal to the threshold value E, the data is considered to be abnormal data.
If the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
The invention simulates the data set collected by the system server platform by using KDDCUP99 as the data set of the network layer and TEP as the data set of the controller and the field device layer, firstly uses DBSCAN clustering algorithm to carry out preliminary de-noising on the two data sets, finds out noise points and removes the noise points, then integrates the data set collected by the simulation system server platform into one data set by using KDDCUP99 as the data set of the network layer and TEP as the data set of the controller and the field device layer, and classifies the data, and then after classifying and extracting the data by LDA linear judgment method, the data characteristics can be converted into 11 dimensions, the 11 dimensional characteristics are not related with each other, and represent the main information of the original data.
The specific result is shown in fig. 2 and fig. 3, compared with the method of directly using the LDA linear decision after filtering out the noise by the DBSCAN clustering algorithm, after filtering the features by the DBSCAN clustering algorithm, the covariance matrix feature value distribution of the sample is more obvious, and the dimension reduction effect on the data is better. Generally, the method for performing dimensionality reduction on server platform data is mainly a Principal Component Analysis (PCA) method, but classification information is not considered, so that after dimensionality reduction is performed, although information loss is reduced to be low, classification may become more difficult. And the data in the original data space can be better classified in the dimension reduction process by using the LDA linear judgment method, and the low-dimensional data obtained by mapping is easier to distinguish.
And selecting an attack type PROBE in a KDDCUP99 data set and normal data for the data after dimension reduction to train the isolated forest model. And comprehensively considering the continuous characteristic and discrete characteristic training results of the data, and comprehensively analyzing the accuracy, recall rate and F1 value of the experimental results. Experiments compare the accuracy, recall rate and F1 value of the improved isolated forest algorithm and the Gaussian-distributed combined data anomaly detection model with the accuracy, recall rate and F1 value of the single isolated forest model through different samples of iTrees in the isolated forest. And comparing the performance of the improved isolated forest algorithm and the performance of the Gaussian distribution combined data anomaly detection model with that of the single Gaussian distribution model.
When the number of samples of the tree is 250, the accuracy, the recall rate and the F1 value of a combined data anomaly detection model of the improved isolated forest algorithm and the Gaussian distribution of the TEP data set are [0.9178743961352657,0.95,0.933660 9336609336]. The accuracy, the recall rate and the F1 value of Gaussian distribution of the TEP data set are [0.8115 942028908, 0.84, 0.825582555282528256 ]. From the experiment results of the KDDCUP99 data set and the TEP data set, the accuracy, the recall rate and the F1 value of the improved isolated forest algorithm and the combined data anomaly detection model of Gaussian distribution are better than those of the results of single detection only by using the isolated forest model or the Gaussian distribution model, and experiments show that the model has the best effect when the number of iTrees in the isolated forest is 250.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (6)

1. A joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution is characterized in that:
step1: accessing server platform data, selecting two linear data sets A = { x = { (X) } 1 ,x 2 ....x i ...x n And B = { y = 1 ,y 2 .....y j .....y n In which x i And y i Respectively representing the linear data of the two selected data sets, then finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A 1 ={x 1 ,x 2 ....x i ...x o } and B 1 ={y 1 ,y 2 .....y j .....y p Denoising the test data;
step2: data set A using LDA linear decision method 1 ={x 1 ,x 2 ....x i ...x o } and B 1 ={y 1 ,y 2 .....y j .....y p Performing dimension reduction treatment to the set A 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p Merge into a new data set D = {(s) 1 ,d 1 ),(s 2 ,d 2 )...(s h ,d h )...(s w ,d w ) In which s is h Representing A in two data sets 1 ={x 1 ,x 2 ....x i ...x o And B 1 ={y 1 ,y 2 .....y j .....y p Linear data x of } linear data x i And y j ,d h Representing linear data x i And y j W = o + p, d h E {0,1}, when s h When being e.g. A, d h When s is =0 h ∈BWhen d is greater than h =1, then by the formula
Figure FDA0002490178750000011
And &>
Figure FDA0002490178750000012
Respectively calculating the inter-class dispersion matrix S of the data set D B And an intra-class dispersion matrix S W Then, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected to i Dimension reduction is carried out, and conversion is carried out to a new sample data set D' = {(s) 1 ,d 1 ),(s 2 ,d 2 )...(s h ,d h )...(s w ,d w ) Obtaining more accurate test data of dimension classification;
step3: sample data set D' = {(s) after dimension reduction processing according to step2 1 ,d 1 ),(s 2 ,d 2 )...(s h ,d h )...(s w ,d w ) Dividing data of the data set D into data of continuous features and data of discrete features, training the continuous feature data by using an isolated forest model, and judging whether the continuous feature data are normal or not according to abnormal scores S (x, n) of the isolated forest; training discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of a Gaussian distribution abnormal score T (r);
step4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
2. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step1 comprises the following specific steps:
step1.1: accessing server platform data, selecting two linear data sets A = { x = 1 ,x 2 ....x i ...x n And B = { y = 1 ,y 2 .....y j .....y n Is given by the formula
Figure FDA0002490178750000013
Separately evaluating each data x i And y i Density value of rho (x) i ) Or ρ (y) j );
If ρ (x) i ) If not less than 0, it is the core point of the data set A, if ρ (y) j ) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formed c And B c Instead of the set A of core points nc And B nc Is shown as
Figure FDA0002490178750000021
And
Figure FDA0002490178750000022
step1.2: finding a boundary set A bd And B bd If x i ∈A nc And is and
Figure FDA0002490178750000023
satisfy q ∈ A and q ∈ N x ∩A c I.e. x i Has a core point, then x i The boundary points are A, and the set formed by all the boundary points in A is collected and recorded as A bd Similarly, B can be obtained bd
Step1.3: the core point set A obtained from Step1.1 and Step1.2 c ,B c And a set of boundary points A bd ,B bd Calculating
Figure FDA0002490178750000024
And &>
Figure FDA0002490178750000025
If x i ∈A noise Then call x i Deleting the noise points as the noise points; if y i ∈B noise Then is called y i It is deleted as a noise point.
3. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein S is obtained in Step2 B And S W The method comprises the following specific steps:
step2.1: wherein S B And S W Are respectively defined as follows:
Figure FDA0002490178750000026
S B =(m 1 -m 2 )(m 1 -m 2 ) T
in the formula, arbitrary sample data x i Is an n-dimensional vector, n i Indicates the number of the i-th type samples,
Figure FDA0002490178750000027
represents the jth sample, m, of the ith class of samples i Represents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.
4. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein the specific steps of training the continuous feature data by using the isolated forest model in Step3 are as follows:
step3.1: training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of iTrees, and repeating the step for each tree to ensure that each tree is different;
step3.1.2: randomly selecting data s h A characteristic G and a value G of the characteristic;
step3.1.3: classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in a left subnode, and placing the sample with the value of the characteristic G larger than G in a right subnode;
step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data or tree heights in the nodes reach the average height L of the trees in the isolated forest model;
step3.2: constructing each tree in the solitary forest as Step3.1 to form a forest, and after training the model of the solitary forest, further calculating the distance of a sample in each tree falling on a leaf node and taking the distance as a basis for judging an abnormal point;
step3.3: firstly, leaf nodes and root nodes are distinguished, the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in a forest is recorded as c (x), and E (h (x)) is the expectation of all paths of a sample in the forest;
the anomaly score is:
Figure FDA0002490178750000031
the value range of S (x, n) is [0,1], and abnormal points are judged according to the value of S (x, n).
5. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein the specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:
step3.4: applying normal distribution in a Gaussian model, training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density function i ,σ i Thereby obtaining a probability distribution of the data;
if a discrete variable s i Following a gaussian distribution, the probability density function is then:
Figure FDA0002490178750000032
/>
for mutually independent random variables { R) not subject to Gaussian distribution 1 ,R 2 ...R i ...R n In which R is i ~(μi~σ i 2 ) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:
Figure FDA0002490178750000033
step3.5: when calculating the probability of the test set data with the probability density function P (r), if the probability density function P (r) is greater than 80%, the data is considered to follow a normal distribution, and if the probability density function P (r) < epsilon, the data is considered to be abnormal data;
the gaussian distribution anomaly score is calculated as: t (r) =1-p (r).
6. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step4 comprises the following specific steps:
comprehensively calculating according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
Figure FDA0002490178750000041
if the comprehensive abnormal score M (x) is larger than or equal to the threshold value E, the data is considered to be abnormal data;
if the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
CN202010402904.2A 2020-05-13 2020-05-13 Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method Active CN111666169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010402904.2A CN111666169B (en) 2020-05-13 2020-05-13 Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010402904.2A CN111666169B (en) 2020-05-13 2020-05-13 Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Publications (2)

Publication Number Publication Date
CN111666169A CN111666169A (en) 2020-09-15
CN111666169B true CN111666169B (en) 2023-03-28

Family

ID=72383483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010402904.2A Active CN111666169B (en) 2020-05-13 2020-05-13 Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Country Status (1)

Country Link
CN (1) CN111666169B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185484A (en) * 2020-10-13 2021-01-05 华北科技学院 AdaBoost model-based water quality characteristic mineral water classification method
CN112711577A (en) * 2020-12-17 2021-04-27 航天信息股份有限公司 Abnormal transaction enterprise identification method and device
CN112990314B (en) * 2021-03-16 2023-04-18 中国科学院长春光学精密机械与物理研究所 Hyperspectral image anomaly detection method and device based on improved isolated forest algorithm
CN112905583A (en) * 2021-04-01 2021-06-04 辽宁工程技术大学 High-dimensional big data outlier detection method
CN113723477B (en) * 2021-08-16 2024-04-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN114826971B (en) * 2022-06-28 2022-12-27 苏州浪潮智能科技有限公司 Server abnormity detection method, device, equipment and readable storage medium
CN115021679B (en) * 2022-08-09 2022-11-04 国网山西省电力公司大同供电公司 Photovoltaic equipment fault detection method based on multi-dimensional outlier detection
CN115457414B (en) * 2022-09-15 2023-05-05 西华大学 Unmanned aerial vehicle abnormal behavior identification method based on improved residual error network
CN115760484A (en) * 2022-12-07 2023-03-07 湖北华中电力科技开发有限责任公司 Method, device and system for improving potential danger identification capability of power distribution station area and storage medium
CN117114677A (en) * 2023-07-12 2023-11-24 北京中盛九橙企业管理咨询有限公司 Digital wallet management system and method
CN117238058B (en) * 2023-11-10 2024-01-26 无锡明诚汽车部件有限公司 Starter monitoring method for automobile based on data analysis
CN117874653A (en) * 2024-03-11 2024-04-12 武汉佳华创新电气有限公司 Power system safety monitoring method and system based on multi-source data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777873A (en) * 2018-06-04 2018-11-09 江南大学 The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend
CN109345137A (en) * 2018-10-22 2019-02-15 广东精点数据科技股份有限公司 A kind of rejecting outliers method based on agriculture big data
CN110046665A (en) * 2019-04-17 2019-07-23 成都信息工程大学 Based on isolated two abnormal classification point detecting method of forest, information data processing terminal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777873A (en) * 2018-06-04 2018-11-09 江南大学 The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend
CN109345137A (en) * 2018-10-22 2019-02-15 广东精点数据科技股份有限公司 A kind of rejecting outliers method based on agriculture big data
CN110046665A (en) * 2019-04-17 2019-07-23 成都信息工程大学 Based on isolated two abnormal classification point detecting method of forest, information data processing terminal

Also Published As

Publication number Publication date
CN111666169A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
CN111666169B (en) Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method
CN110263846B (en) Fault diagnosis method based on fault data deep mining and learning
CN108737406B (en) Method and system for detecting abnormal flow data
CN107493277B (en) Large data platform online anomaly detection method based on maximum information coefficient
CN111798312A (en) Financial transaction system abnormity identification method based on isolated forest algorithm
CN112039903B (en) Network security situation assessment method based on deep self-coding neural network model
CN111556016B (en) Network flow abnormal behavior identification method based on automatic encoder
CN111898639B (en) Dimension reduction-based hierarchical time memory industrial anomaly detection method and device
CN112199670B (en) Log monitoring method for improving IFOREST (entry face detection sequence) to conduct abnormity detection based on deep learning
CN111915418A (en) Internet financial fraud online detection method and device
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN113556319B (en) Intrusion detection method based on long-short term memory self-coding classifier under internet of things
CN114124482A (en) Access flow abnormity detection method and device based on LOF and isolated forest
CN114707571A (en) Credit data anomaly detection method based on enhanced isolation forest
CN113537321A (en) Network traffic anomaly detection method based on isolated forest and X-means
CN114067915A (en) scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder
CN117421684A (en) Abnormal data monitoring and analyzing method based on data mining and neural network
CN115438102A (en) Space-time data anomaly identification method and device and electronic equipment
CN111600878A (en) Low-rate denial of service attack detection method based on MAF-ADM
Wayahdi et al. Evaluation of the K-Nearest Neighbor Model With K-Fold Cross Validation on Image Classification
CN116400168A (en) Power grid fault diagnosis method and system based on depth feature clustering
CN113722230B (en) Integrated evaluation method and device for vulnerability mining capability of fuzzy test tool
CN116318925A (en) Multi-CNN fusion intrusion detection method, system, medium, equipment and terminal
CN114896228A (en) Industrial data stream cleaning model and method based on multi-stage combination optimization of filtering rules
CN112422505A (en) Network malicious traffic identification method based on high-dimensional extended key feature vector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant