CN111666169A - Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method - Google Patents

Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method Download PDF

Info

Publication number
CN111666169A
CN111666169A CN202010402904.2A CN202010402904A CN111666169A CN 111666169 A CN111666169 A CN 111666169A CN 202010402904 A CN202010402904 A CN 202010402904A CN 111666169 A CN111666169 A CN 111666169A
Authority
CN
China
Prior art keywords
data
abnormal
gaussian distribution
isolated forest
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010402904.2A
Other languages
Chinese (zh)
Other versions
CN111666169B (en
Inventor
田园
原野
马文
黄祖源
郭一翰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Center of Yunnan Power Grid Co Ltd
Original Assignee
Information Center of Yunnan Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Center of Yunnan Power Grid Co Ltd filed Critical Information Center of Yunnan Power Grid Co Ltd
Priority to CN202010402904.2A priority Critical patent/CN111666169B/en
Publication of CN111666169A publication Critical patent/CN111666169A/en
Application granted granted Critical
Publication of CN111666169B publication Critical patent/CN111666169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection. The invention analyzes the platform data, adopts an improved joint detection model of isolated forest and Gaussian distribution to detect data abnormity, and adopts a detection model of Gaussian distribution and isolated forest combination to accurately detect abnormity of an industrial control system on the industrial internet platform in real time aiming at two continuous and discrete characteristics of the industrial internet platform, thereby finding out the abnormal state of the system in time. And comparing the detection effect of the model based on the combination of the isolated forest and the Gaussian distribution trained by the invention with the detection effect of a single Gaussian distribution or isolated forest model. The efficiency of data anomaly detection is improved in the detection of server platform data anomaly, anomaly can be found more quickly, time cost is saved, and the method has a wide application prospect.

Description

Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method
Technical Field
The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection of a service monitoring server platform.
Background
A server is one of computers that runs faster, is more heavily loaded, and is more expensive than a regular computer. The server provides calculation or application services for other clients (such as terminals like PC, smart phone, ATM and the like and even large equipment like train systems and the like) in the network. The server platform data is acquired by the data acquisition unit and comprises data of an industrial control system field device layer, a control system and a network layer, the data contains a plurality of important information, the most important information is that the abnormal state of the server system can be analyzed from the information, and an abnormal event of the server system is likely to cause huge production economic loss and even seriously endanger the life health and safety of people. In the prior art, most of judgment methods for server data abnormity are that a server system platform needs to analyze acquired data of a control system and a network layer to find whether the system is abnormal, most of platform data are high-dimensional and noisy data, and in a model training process for detecting the abnormity, the high-dimensional data can reduce the accuracy of a model, the calculation complexity is high, and the abnormity cannot be well detected, so that dimension reduction processing must be performed on the high-dimensional data.
Disclosure of Invention
The invention aims to provide a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for solving the problems.
The technical scheme of the invention is as follows: a joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution comprises the following specific steps:
step 1: is connected withEntering server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIn which xiAnd yiRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypAnd denoising the test data.
Step 2: data set A using LDA linear decision method1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypPerforming further dimensionality reduction, and inputting a data set A obtained by preliminarily removing noise points in Step11={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypTo use LDA linear decision method conveniently, set A is put1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypMerge into a new data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) In which s ishRepresenting A in two data sets1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypLinear data x of } linear data xiAnd yj,dhRepresenting linear data xiAnd yjW ═ o + p, dh∈ {0, 1}, when s ish∈ A, dhWhen s is equal to 0h∈ B, d h1 and then by the formula
Figure BDA0002490178760000021
And
Figure BDA0002490178760000022
respectively calculating the inter-class dispersion matrix S of the data set DBAnd an intra-class dispersion matrix SWThen, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected toiReducing dimension and converting into a new sample data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) And obtaining the test data with more accurate dimension classification.
Step 3: sample data set after dimension reduction processing according to step2
Figure BDA0002490178760000027
According to the method, a discrete variable refers to data of which the numerical value can only be described by natural numbers, integers, counting units and the like, a continuous variable refers to data which can be taken as values at will in a certain interval, two adjacent numerical values can be subjected to infinite segmentation (namely, infinite values can be taken) as the basis, data of a data set D are divided into data of continuous characteristics and data of discrete characteristics, an isolated forest model is used for training the continuous characteristic data, and whether the continuous characteristic data are normal or not is judged according to abnormal scores S (x, n) of the isolated forest; and training the discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the abnormal score T (r) of Gaussian distribution.
Step 4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
The specific steps of Step1 are as follows:
step1.1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIs given by the formula
Figure BDA0002490178760000023
Separately evaluating each data xiAnd yiDensity value of rho (x)i) Or ρ (y)j)。
If ρ (x)i) If not less than 0, it is the core point of the data set A, if ρ (y)j) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formedcAnd BcInstead of the set A of core pointsncAnd BncIs shown as
Figure BDA0002490178760000024
And
Figure BDA0002490178760000025
step1.2: finding a boundary set AbdAnd BbdIf xi∈AncAnd is and
Figure BDA0002490178760000026
satisfy q ∈ A and q ∈ Nx∩AcI.e. xiHas a core point, then xiThe boundary points are A, and the set formed by all the boundary points in A is collected and recorded as AbdSimilarly, B can be obtainedbd
Step1.3: the core point set A obtained from Step1.1 and Step1.2c,BcAnd a set of boundary points Abd,BbdCalculating
Figure BDA0002490178760000031
And
Figure BDA0002490178760000032
if xi∈AnoiseThen call xiDeleting the noise points as the noise points; if yi∈BnoiseThen call yiIt is deleted as a noise point.
In Step2, S is obtainedBAnd SWThe method comprises the following specific steps:
step2.1: wherein SBAnd SWAre respectively defined as follows:
Figure BDA0002490178760000033
Figure BDA0002490178760000034
in the formula, arbitrary sample data xiIs an n-dimensional vector, niIndicates the number of the i-th type samples,
Figure BDA0002490178760000035
denotes the jth sample, m, of the ith class of samplesiRepresents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.
According to the data set D {(s) described in step21,d1),(s2,d2)...(sh,dh)...(sw,dw) The number of classes c is 2, so only two classes of sample data need to be considered, shIs a one-dimensional vector, so the formula in Step2.1 can be expressed as
Figure BDA0002490178760000036
And SB=(m1-m2)(m1-m2)TThe calculation amount can be reduced, and the calculation efficiency can be improved.
The specific steps of training the continuous characteristic data by using the isolated forest model in Step3 are as follows:
step3.1: there are many trees in an soliton forest, and each tree is a binary tree called iTree. The nodes in the iTree either have two children, a left child and a right child, or are themselves leaf nodes. Training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of the iTrees, and repeating the step for each tree to be trained so as to ensure that each tree is different.
Step3.1.2: randomly selecting data shAnd the value G of the feature.
Step3.1.3: and classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in the left subnode, and placing the sample with the value of the characteristic G larger than G in the right subnode.
Step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data exist in the nodes; alternatively, the height of the tree has reached the average height L of the tree in the isolated forest model.
Step3.2: and constructing each tree in the isolated forest as Step3.1 to form a forest, and after training the isolated forest model, further calculating the distance of a sample in each tree falling on a leaf node, and taking the distance as a basis for judging an abnormal point.
Step3.3: firstly, leaf nodes and root nodes are distinguished (the root node is the topmost node of the tree, and is a leaf node when no child nodes exist after continuous subdivision), the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.
The anomaly score is:
Figure BDA0002490178760000041
the value range of S (x, n) is [0,1], and abnormal points are judged according to the value of S (x, n).
The specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:
step3.4: applying normal distribution in a Gaussian model (abnormal detection model), training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density functioni,σiThereby obtaining a probability distribution of the data.
If a discrete variablesiFollowing a gaussian distribution, the probability density function is then:
Figure BDA0002490178760000042
for mutually independent random variables { R) not subject to Gaussian distribution1,R2...Ri...RnIn which R isi~(μi~σi 2) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:
Figure BDA0002490178760000043
step3.5: when calculating the probability of the test set data using the probability density function p (r), if the probability density function p (r) is greater than 80%, the data is considered to follow a normal distribution.
If the probability density function P (r) < then, the data is considered abnormal data.
The gaussian distribution anomaly score is calculated as: t (r) 1-p (r).
The specific steps of Step4 are as follows:
performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
Figure BDA0002490178760000051
and if the comprehensive abnormal score M (x) is greater than or equal to the threshold value E, the data is considered to be abnormal data.
If the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
The invention provides a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for carrying out anomaly detection on data acquired on a server platform so as to accurately detect anomalies of a control system on the server platform in real time and find the anomalous state of platform data in time. According to the model, a DBSCAN clustering algorithm is used for preliminary denoising of data in the detection of data abnormity of a server platform, then the LDA linear judgment method is used for further data dimension reduction processing, so that the data in an original data space can be better classified in the dimension reduction process, low-dimensional data obtained through mapping can be distinguished more easily, the abnormal state of the data can be better analyzed later, the efficiency of data abnormity detection is improved, abnormity can be found more quickly, the time cost is saved, and the model has a wide application prospect.
The invention has the beneficial effects that:
1. in the process of model training for detecting the abnormality, the accuracy of the model can be reduced by high-dimensional data, the calculation complexity is high, and the abnormality cannot be well detected, so that the high-dimensional data is subjected to dimension reduction treatment by using a random forest model, the abnormal state of the data can be better analyzed by removing noise, and the abnormality detection model with high accuracy and recall rate is trained.
2. After noise is removed, feature extraction is carried out on the data subjected to noise reduction through a principal component analysis method, the data are converted into irrelevant low-dimensional new data, the new features are relatively independent and contain main information of the original data, and then an anomaly detection model with high accuracy and recall rate is trained.
3. And (3) comprehensively considering two models respectively trained by continuous features and discrete features of the data, carrying out weighted calculation on abnormal scores analyzed by the two models, and if the result is greater than a certain threshold value, considering the data as abnormal data. Experimental results show that the detection effect of the combined anomaly detection model based on the isolated forest and the Gaussian distribution trained by the invention is better than that of a single isolated forest or Gaussian distribution.
4. The invention can improve the efficiency of data anomaly detection, find anomalies more quickly, save time cost and have wide application prospect.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a diagram illustrating distribution of characteristic values of LDA obtained after random forest denoising according to an embodiment of the present invention;
fig. 3 is a diagram illustrating the distribution of characteristic values of LDA obtained without performing random forest denoising in the embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
As shown in fig. 1, a method for detecting abnormal joint data based on an improved isolated forest algorithm and gaussian distribution includes the following steps:
step 1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIn which xiAnd yiRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypAnd denoising the test data.
Step 2: data set A using LDA linear decision method1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypPerforming further dimension reduction treatment to the set A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypMerge into a new data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) In which s ishRepresenting A in two data sets1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypLinear data x of } linear data xiAnd yj,dhRepresenting linear numbersAccording to xiAnd yjW ═ o + p, dh∈ {0, 1}, when s ish∈ A, dhWhen s is equal to 0h∈ B, dh1 and then by the formula
Figure BDA0002490178760000061
And
Figure BDA0002490178760000062
respectively calculating the inter-class dispersion matrix S of the data set DBAnd an intra-class dispersion matrix SWThen, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected toiReducing dimension and converting into a new sample data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) And obtaining the test data with more accurate dimension classification.
Step 3: sample data set after dimension reduction processing according to step2
Figure BDA0002490178760000063
According to the method, a discrete variable refers to data of which the numerical value can only be described by natural numbers, integers, counting units and the like, a continuous variable refers to data which can be taken as values at will in a certain interval, two adjacent numerical values can be subjected to infinite segmentation (namely, infinite values can be taken) as the basis, data of a data set D are divided into data of continuous characteristics and data of discrete characteristics, an isolated forest model is used for training the continuous characteristic data, and whether the continuous characteristic data are normal or not is judged according to abnormal scores S (x, n) of the isolated forest; and training the discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the abnormal score T (r) of Gaussian distribution.
Step 4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
The specific steps of Step1 are as follows:
step1.1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIs given by the formula
Figure BDA0002490178760000071
Separately evaluating each data xiAnd yiDensity value of rho (x)i) Or ρ (y)j)。
If ρ (x)i) If not less than 0, it is the core point of the data set A, if ρ (y)j) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formedcAnd BcInstead of the set A of core pointsncAnd BncIs shown as
Figure BDA0002490178760000072
And
Figure BDA0002490178760000073
step1.2: finding a boundary set AbdAnd BbdIf xi∈AncAnd is and
Figure BDA0002490178760000074
satisfy q ∈ A and q ∈ Nx∩AcI.e. xiHas a core point, then xiThe boundary points are A, and the set formed by all the boundary points in A is collected and recorded as AbdSimilarly, B can be obtainedbd
Step1.3: the core point set A obtained from Step1.1 and Step1.2c,BcAnd a set of boundary points Abd,BbdCalculating
Figure BDA0002490178760000075
And
Figure BDA0002490178760000076
if xi∈AnoiseThen call xiDeleting the noise points as the noise points; if yi∈BnoiseThen call yiIt is deleted as a noise point.
In Step2, S is obtainedBAnd SWThe method comprises the following specific steps:
step2.1: wherein SBAnd SWAre respectively defined as follows:
Figure BDA0002490178760000077
SB=(m1-m2)(m1-m2)T
in the formula, arbitrary sample data xiIs an n-dimensional vector, niIndicates the number of the i-th type samples,
Figure BDA0002490178760000078
denotes the jth sample, m, of the ith class of samplesiRepresents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.
The specific steps of training the continuous characteristic data by using the isolated forest model in Step3 are as follows:
step3.1: there are many trees in an soliton forest, and each tree is a binary tree called iTree. The nodes in the iTree either have two children, a left child and a right child, or are themselves leaf nodes. Training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of the iTrees, and repeating the step for each tree to be trained so as to ensure that each tree is different.
Step3.1.2: randomly selecting data shAnd the value G of the feature.
Step3.1.3: and classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in the left subnode, and placing the sample with the value of the characteristic G larger than G in the right subnode.
Step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data exist in the nodes; alternatively, the height of the tree has reached the average height L of the tree in the isolated forest model.
Step3.2: and constructing each tree in the isolated forest as Step3.1 to form a forest, and after training the isolated forest model, further calculating the distance of a sample in each tree falling on a leaf node, and taking the distance as a basis for judging an abnormal point.
Step3.3: firstly, leaf nodes and root nodes are distinguished (the root node is the topmost node of the tree, and is a leaf node when no child nodes exist after continuous subdivision), the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.
The anomaly score is:
Figure BDA0002490178760000081
the value range of S (x, n) is [0,1], and the abnormal point is judged according to the value of S (x, n), and the closer to 1, the more likely the abnormal point is.
The specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:
step3.4: applying normal distribution in a Gaussian model (abnormal detection model), training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density functioni,σiThereby obtaining a probability distribution of the data.
If a discrete variable siFollowing a gaussian distribution, the probability density function is then:
Figure BDA0002490178760000082
for mutually independent randoms not obeying Gaussian distributionVariable { R1,R2...Ri...RnIn which R isi~(μi~σi 2) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:
Figure BDA0002490178760000083
step3.5: when calculating the probability of the test set data using the probability density function p (r), if the probability density function p (r) is greater than 80%, the data is considered to follow a normal distribution.
If the probability density function P (r) is less than the threshold, the data is considered as abnormal data; and for the value, the tolerance of the abnormal data is selected according to the requirement according to the abnormal detection, and the value is 0.05.
The gaussian distribution anomaly score is calculated as: t (r) 1-p (r).
The abnormal condition is determined according to the Gaussian distribution abnormal score, and the closer to 1, the T (r) is, the more abnormal points are.
The specific steps of Step4 are as follows:
performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
Figure BDA0002490178760000091
the threshold E for the anomaly score was 0.45 by experiment on the test set.
And if the comprehensive abnormal score M (x) is greater than or equal to the threshold value E, the data is considered to be abnormal data.
If the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
The invention uses KDDCUP99 as the data set of the network layer, TEP as the data set of the controller and the field device layer to simulate the data set collected by the system server platform, firstly uses DBSCAN clustering algorithm to primarily denoise the two data sets, finds out noise points and removes the noise points, then integrates the data set collected by the simulation system server platform into a data set by using KDDCUP99 as the data set of the network layer and the TEP as the data set of the controller and the field device layer, and classifies the data, and then classifies and extracts the data by LDA linear judgment method, the data characteristics can be transformed into 11 dimensions, the 11-dimensional characteristics are not related to each other, and represent the main information of the original data.
The specific result is shown in fig. 2 and fig. 3, compared with the method of directly using the LDA linear decision after filtering out the noise by the DBSCAN clustering algorithm, after filtering the features by the DBSCAN clustering algorithm, the covariance matrix feature value distribution of the sample is more obvious, and the dimension reduction effect on the data is better. Generally, the method for performing dimensionality reduction on server platform data is mainly a Principal Component Analysis (PCA) method, but classification information is not considered, so that after dimensionality reduction is performed, although information loss is reduced to be low, classification may become more difficult. And the data in the original data space can be better classified in the dimension reduction process by using the LDA linear judgment method, and the low-dimensional data obtained by mapping is easier to distinguish.
And selecting an attack type PROBE in a KDDCUP99 data set and normal data for the data after dimension reduction to train the isolated forest model. And comprehensively considering the continuous characteristic and discrete characteristic training results of the data, and comprehensively analyzing the accuracy, the recall rate and the F1 value of the experimental results. Experiments compare the accuracy, recall and F1 values of the improved isolated forest algorithm and gaussian distributed joint data anomaly detection model with the isolated forest model alone, through different samples of itrees in the isolated forest. And comparing the performance of the improved isolated forest algorithm and the performance of the Gaussian distribution combined data anomaly detection model with that of the single Gaussian distribution model.
When the number of samples of the tree is 250, the accuracy rate and the recall rate of a combined data anomaly detection model of an improved isolated forest algorithm and Gaussian distribution of the TEP data set and an F1 value are [0.9178743961352657,0.95 and 0.9336609336609336], the accuracy rate and the recall rate of the Gaussian distribution of the TEP data set and an F1 value are [ 0.8115942028985508, 0.84 and 0.8255528255528256 ]. From the experiment results of the KDDCUP99 data set and the TEP data set, the accuracy, the recall rate and the F1 value of the improved isolated forest algorithm and the combined data anomaly detection model of Gaussian distribution are better than those of the single detection result of the isolated forest model or the Gaussian distribution model, and experiments show that the model has the best effect when the number of iTrees in the isolated forest is 250.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (6)

1. A joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution is characterized in that:
step 1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIn which xiAnd yiRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypDenoising the test data;
step 2: data set A using LDA linear decision method1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypPerforming dimension reduction treatment to the set A1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypMerge into a new data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) In which s ishRepresenting A in two data sets1={x1,x2....xi...xoAnd B1={y1,y2.....yj.....ypLinear data x of } linear data xiAnd yj,dhRepresenting linear data xiAnd yjW ═ o + p, dh∈ {0, 1}, when s ish∈ A, dhWhen s is equal to 0h∈ B, dh1 and then by the formula
Figure FDA0002490178750000011
And
Figure FDA0002490178750000012
respectively calculating the inter-class dispersion matrix S of the data set DBAnd an intra-class dispersion matrix SWThen, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected toiReducing dimension and converting into a new sample data set D {(s)1,d1),(s2,d2)...(sh,dh)...(sw,dw) Obtaining more accurate test data of dimension classification;
step 3: and D' {(s) of the sample data set subjected to dimensionality reduction processing according to step21,d1),(s2,d2)...(sh,dh)...(sw,dw) Dividing data of the data set D into data of continuous features and data of discrete features, training the continuous feature data by using an isolated forest model, and judging whether the continuous feature data are normal or not according to abnormal scores S (x, n) of the isolated forest; training discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the Gaussian distribution abnormal score T (r);
step 4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.
2. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step1 comprises the following specific steps:
step1.1: accessing server platform data, selecting two linear data sets A ═ x1,x2....xi...xnY and B ═ y1,y2.....yj.....ynIs given by the formula
Figure FDA0002490178750000013
Separately evaluating each data xiAnd yiDensity value of rho (x)i) Or ρ (y)j);
If ρ (x)i) If not less than 0, it is the core point of the data set A, if ρ (y)j) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formedcAnd BcInstead of the set A of core pointsncAnd BncIs shown as
Figure FDA0002490178750000021
And
Figure FDA0002490178750000022
step1.2: finding a boundary set AbdAnd BbdIf xi∈AncAnd is and
Figure FDA0002490178750000023
satisfy q ∈ A and q ∈ Nx∩AcI.e. xiHas a core point, then xiThe boundary points are A, and the set formed by all the boundary points in A is collected and recorded as AbdSimilarly, B can be obtainedbd
Step1.3: the core point set A obtained from Step1.1 and Step1.2c,BcAnd a set of boundary points Abd,BbdCalculating
Figure FDA0002490178750000024
And
Figure FDA0002490178750000025
if xi∈AnoiseThen call xiDeleting the noise points as the noise points; if yi∈BnoiseThen call yiIt is deleted as a noise point.
3. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein S is obtained in Step2BAnd SWThe method comprises the following specific steps:
step2.1: wherein SBAnd SWAre respectively defined as follows:
Figure FDA0002490178750000026
SB=(m1-m2)(m1-m2)T
in the formula, arbitrary sample data xiIs an n-dimensional vector, niIndicates the number of the i-th type samples,
Figure FDA0002490178750000027
denotes the jth sample, m, of the ith class of samplesiRepresents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.
4. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the specific steps of training the continuous feature data by using the isolated forest model in Step3 are as follows:
step3.1: training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:
step3.1.1: randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of iTrees, and repeating the step for each tree to ensure that each tree is different;
step3.1.2: randomly selecting data shA characteristic G and a value G of the characteristic;
step3.1.3: classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in a left subnode, and placing the sample with the value of the characteristic G larger than G in a right subnode;
step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:
only one data or a plurality of same data or tree heights in the nodes reach the average height L of the trees in the isolated forest model;
step3.2: constructing each tree in the solitary forest as Step3.1 to form a forest, and after training the model of the solitary forest, further calculating the distance of a sample in each tree falling on a leaf node and taking the distance as a basis for judging an abnormal point;
step3.3: firstly, distinguishing leaf nodes and root nodes, wherein the path length from a leaf node to a root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in a forest is recorded as c (x), and E (h (x)) is the expectation of all paths of a sample in the forest;
the anomaly score is:
Figure FDA0002490178750000031
the value range of S (x, n) is [0,1], and abnormal points are judged according to the value of S (x, n).
5. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step3 for training the discrete feature data by using the Gaussian model comprises the following specific steps:
step3.4: applying normal distribution in a Gaussian model, training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density functioni,σiThereby obtaining a probability distribution of the data;
if a discrete variable siFollowing a gaussian distribution, the probability density function is then:
Figure FDA0002490178750000032
for mutually independent random variables { R) not subject to Gaussian distribution1,R2...Ri...RnIn which R isi~(μi~σi 2) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:
Figure FDA0002490178750000033
step3.5: when calculating the probability of the test set data by using a probability density function P (r), if the probability density function P (r) is more than 80%, the data is considered to be subjected to normal distribution, and if the probability density function P (r) is less, the data is considered to be abnormal data;
the gaussian distribution anomaly score is calculated as: t (r) 1-p (r).
6. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step4 comprises the following specific steps:
performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:
Figure FDA0002490178750000041
if the comprehensive abnormal score M (x) is larger than or equal to the threshold value E, the data is considered to be abnormal data;
if the composite anomaly score M (x) is less than the threshold E, the data is considered normal.
CN202010402904.2A 2020-05-13 2020-05-13 Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method Active CN111666169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010402904.2A CN111666169B (en) 2020-05-13 2020-05-13 Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010402904.2A CN111666169B (en) 2020-05-13 2020-05-13 Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Publications (2)

Publication Number Publication Date
CN111666169A true CN111666169A (en) 2020-09-15
CN111666169B CN111666169B (en) 2023-03-28

Family

ID=72383483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010402904.2A Active CN111666169B (en) 2020-05-13 2020-05-13 Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Country Status (1)

Country Link
CN (1) CN111666169B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185484A (en) * 2020-10-13 2021-01-05 华北科技学院 AdaBoost model-based water quality characteristic mineral water classification method
CN112711577A (en) * 2020-12-17 2021-04-27 航天信息股份有限公司 Abnormal transaction enterprise identification method and device
CN112905583A (en) * 2021-04-01 2021-06-04 辽宁工程技术大学 High-dimensional big data outlier detection method
CN112990314A (en) * 2021-03-16 2021-06-18 中国科学院长春光学精密机械与物理研究所 Hyperspectral image anomaly detection method and device based on improved isolated forest algorithm
CN113723477A (en) * 2021-08-16 2021-11-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN114826971A (en) * 2022-06-28 2022-07-29 苏州浪潮智能科技有限公司 Server abnormity detection method, device, equipment and readable storage medium
CN114925731A (en) * 2022-06-06 2022-08-19 华电金沙江上游水电开发有限公司叶巴滩分公司 Method for detecting abnormal value of monitoring data of flexible inclinometer
CN115021679A (en) * 2022-08-09 2022-09-06 国网山西省电力公司大同供电公司 Photovoltaic equipment fault detection method based on multi-dimensional outlier detection
CN115760484A (en) * 2022-12-07 2023-03-07 湖北华中电力科技开发有限责任公司 Method, device and system for improving potential danger identification capability of power distribution station area and storage medium
CN115457414B (en) * 2022-09-15 2023-05-05 西华大学 Unmanned aerial vehicle abnormal behavior identification method based on improved residual error network
CN117114677A (en) * 2023-07-12 2023-11-24 北京中盛九橙企业管理咨询有限公司 Digital wallet management system and method
CN117238058A (en) * 2023-11-10 2023-12-15 无锡明诚汽车部件有限公司 Starter monitoring method for automobile based on data analysis
CN117874653A (en) * 2024-03-11 2024-04-12 武汉佳华创新电气有限公司 Power system safety monitoring method and system based on multi-source data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777873A (en) * 2018-06-04 2018-11-09 江南大学 The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend
CN109345137A (en) * 2018-10-22 2019-02-15 广东精点数据科技股份有限公司 A kind of rejecting outliers method based on agriculture big data
CN110046665A (en) * 2019-04-17 2019-07-23 成都信息工程大学 Based on isolated two abnormal classification point detecting method of forest, information data processing terminal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777873A (en) * 2018-06-04 2018-11-09 江南大学 The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend
CN109345137A (en) * 2018-10-22 2019-02-15 广东精点数据科技股份有限公司 A kind of rejecting outliers method based on agriculture big data
CN110046665A (en) * 2019-04-17 2019-07-23 成都信息工程大学 Based on isolated two abnormal classification point detecting method of forest, information data processing terminal

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185484A (en) * 2020-10-13 2021-01-05 华北科技学院 AdaBoost model-based water quality characteristic mineral water classification method
CN112711577A (en) * 2020-12-17 2021-04-27 航天信息股份有限公司 Abnormal transaction enterprise identification method and device
CN112990314B (en) * 2021-03-16 2023-04-18 中国科学院长春光学精密机械与物理研究所 Hyperspectral image anomaly detection method and device based on improved isolated forest algorithm
CN112990314A (en) * 2021-03-16 2021-06-18 中国科学院长春光学精密机械与物理研究所 Hyperspectral image anomaly detection method and device based on improved isolated forest algorithm
CN112905583A (en) * 2021-04-01 2021-06-04 辽宁工程技术大学 High-dimensional big data outlier detection method
CN113723477A (en) * 2021-08-16 2021-11-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN113723477B (en) * 2021-08-16 2024-04-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN114925731A (en) * 2022-06-06 2022-08-19 华电金沙江上游水电开发有限公司叶巴滩分公司 Method for detecting abnormal value of monitoring data of flexible inclinometer
CN114925731B (en) * 2022-06-06 2024-05-31 华电金沙江上游水电开发有限公司叶巴滩分公司 Method for detecting abnormal value of monitoring data of flexible inclinometer
CN114826971A (en) * 2022-06-28 2022-07-29 苏州浪潮智能科技有限公司 Server abnormity detection method, device, equipment and readable storage medium
WO2024001254A1 (en) * 2022-06-28 2024-01-04 苏州元脑智能科技有限公司 Server anomaly detection method and apparatus, device, and readable storage medium
CN115021679B (en) * 2022-08-09 2022-11-04 国网山西省电力公司大同供电公司 Photovoltaic equipment fault detection method based on multi-dimensional outlier detection
CN115021679A (en) * 2022-08-09 2022-09-06 国网山西省电力公司大同供电公司 Photovoltaic equipment fault detection method based on multi-dimensional outlier detection
CN115457414B (en) * 2022-09-15 2023-05-05 西华大学 Unmanned aerial vehicle abnormal behavior identification method based on improved residual error network
CN115760484A (en) * 2022-12-07 2023-03-07 湖北华中电力科技开发有限责任公司 Method, device and system for improving potential danger identification capability of power distribution station area and storage medium
CN117114677A (en) * 2023-07-12 2023-11-24 北京中盛九橙企业管理咨询有限公司 Digital wallet management system and method
CN117238058A (en) * 2023-11-10 2023-12-15 无锡明诚汽车部件有限公司 Starter monitoring method for automobile based on data analysis
CN117238058B (en) * 2023-11-10 2024-01-26 无锡明诚汽车部件有限公司 Starter monitoring method for automobile based on data analysis
CN117874653A (en) * 2024-03-11 2024-04-12 武汉佳华创新电气有限公司 Power system safety monitoring method and system based on multi-source data
CN117874653B (en) * 2024-03-11 2024-05-31 武汉佳华创新电气有限公司 Power system safety monitoring method and system based on multi-source data

Also Published As

Publication number Publication date
CN111666169B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN111666169B (en) Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method
CN108737406B (en) Method and system for detecting abnormal flow data
CN110263846B (en) Fault diagnosis method based on fault data deep mining and learning
CN107493277B (en) Large data platform online anomaly detection method based on maximum information coefficient
CN111798312A (en) Financial transaction system abnormity identification method based on isolated forest algorithm
CN111556016B (en) Network flow abnormal behavior identification method based on automatic encoder
CN102176698A (en) Method for detecting abnormal behaviors of user based on transfer learning
CN112039903B (en) Network security situation assessment method based on deep self-coding neural network model
CN111915418A (en) Internet financial fraud online detection method and device
CN117421684A (en) Abnormal data monitoring and analyzing method based on data mining and neural network
CN112580749B (en) Intelligent fire detection method based on machine olfaction technology
CN113556319B (en) Intrusion detection method based on long-short term memory self-coding classifier under internet of things
CN114124482A (en) Access flow abnormity detection method and device based on LOF and isolated forest
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN113489685A (en) Secondary feature extraction and malicious attack identification method based on kernel principal component analysis
CN112199670A (en) Log monitoring method for improving IFOREST (entry face detection sequence) to conduct abnormity detection based on deep learning
CN111600878A (en) Low-rate denial of service attack detection method based on MAF-ADM
CN115438102A (en) Space-time data anomaly identification method and device and electronic equipment
CN115659143A (en) Fault real-time diagnosis method based on experimental design
CN110189799B (en) Metagenome feature selection method based on variable importance score and Neyman Pearson test
CN118194186A (en) Snail meat feed production management method and system based on Internet of things
CN116318925A (en) Multi-CNN fusion intrusion detection method, system, medium, equipment and terminal
CN113722230B (en) Integrated evaluation method and device for vulnerability mining capability of fuzzy test tool
CN113780084B (en) Face data amplification method based on generation type countermeasure network, electronic equipment and storage medium
CN113010673A (en) Vulnerability automatic classification method based on entropy optimization support vector machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant