CN111666169B

CN111666169B - Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Info

Publication number: CN111666169B
Application number: CN202010402904.2A
Authority: CN
Inventors: 田园; 原野; 马文; 黄祖源; 郭一翰
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2023-03-28
Anticipated expiration: 2040-05-13
Also published as: CN111666169A

Abstract

The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection. The invention analyzes the platform data, adopts an improved joint detection model of isolated forest and Gaussian distribution to detect data abnormity, and adopts a detection model of Gaussian distribution and isolated forest combination to accurately detect abnormity of an industrial control system on the industrial internet platform in real time aiming at two continuous and discrete characteristics of the industrial internet platform, thereby finding out the abnormal state of the system in time. And comparing the detection effect of the model based on the combination of the isolated forest and the Gaussian distribution trained by the invention with the detection effect of a single Gaussian distribution or isolated forest model. The efficiency of data anomaly detection is improved in the detection of server platform data anomaly, anomaly can be found more quickly, time cost is saved, and the method has a wide application prospect.

Description

Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Technical Field

The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection of a service monitoring server platform.

Background

A server is one of computers that runs faster, is more heavily loaded, and is more expensive than a regular computer. The server provides calculation or application services for other clients (such as terminals like PC, smart phone, ATM and the like and even large equipment like train system and the like) in the network. The server platform data is acquired by the data acquisition unit and comprises data of an industrial control system field device layer, a control system and a network layer, the data contains a plurality of important information, the most important information is that the abnormal state of the server system can be analyzed from the information, and an abnormal event of the server system is likely to cause huge production economic loss and even seriously endanger the life health and safety of people. In the prior art, most of judgment methods for server data abnormity are that a server system platform needs to analyze acquired data of a control system and a network layer to find whether the system is abnormal, most of platform data are high-dimensional and noisy data, and in a model training process for detecting the abnormity, the high-dimensional data can reduce the accuracy of a model, the calculation complexity is high, and the abnormity cannot be well detected, so that dimension reduction processing must be performed on the high-dimensional data.

Disclosure of Invention

The invention aims to provide a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for solving the problems.

The technical scheme of the invention is as follows: a joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution comprises the following specific steps:

step1: accessing server platform data, selecting two linear data sets A = { x = { (X) } ₁ ,x ₂ ....x _i ...x _n } and B = { y ₁ ,y ₂ .....y _j .....y _n In which x is _i And y _i Respectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o } and B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p And denoising the test data.

Step2: data set A using LDA linear decision method ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o } and B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Performing further dimensionality reduction, and inputting a data set A obtained by preliminarily removing noise points in Step1 ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p To use LDA linear decision method conveniently, set A is put ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Merge into a new data set D = {(s) ₁ ,d ₁ ),(s ₂ ,d ₂ )...(s _h ,d _h )...(s _w ,d _w ) In which s is _h Representing A in two data sets ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Linear data x of } linear data x _i And y _j ，d _h Representing linear data x _i And y _j W = o + p, d _h E {0,1}, when s _h When epsilon is A, d _h Is =0, when s _h When epsilon is B, d _h =1, then by the formula

And &>

Respectively calculating the inter-class dispersion matrix S of the data set D _B And an intra-class dispersion matrix S _W Then, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected to _i Dimension reduction is carried out, and conversion is carried out to a new sample data set D' = {(s) ₁ ,d ₁ ),(s ₂ ,d ₂ )...(s _h ,d _h )...(s _w ,d _w ) And obtaining the test data with more accurate dimension classification.

Step3: sample data set after dimension reduction processing according to step2

According to the method, a discrete variable refers to data of which the numerical value can only be described by natural numbers, integers, counting units and the like, a continuous variable refers to data which can be taken as values at will in a certain interval, two adjacent numerical values can be subjected to infinite segmentation (namely, infinite values can be taken) as the basis, data of a data set D are divided into data of continuous characteristics and data of discrete characteristics, an isolated forest model is used for training the continuous characteristic data, and whether the continuous characteristic data are normal or not is judged according to abnormal scores S (x, n) of the isolated forest; and training the discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the abnormal score T (r) of Gaussian distribution.

Step4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.

The specific steps of Step1 are as follows:

step1.1: accessing server platform data, selecting two linear data sets A = { x = { (X) } ₁ ,x ₂ ....x _i ...x _n } and B = { y ₁ ,y ₂ .....y _j .....y _n Is given by the formula

Separately evaluating each data x _i And y _i Density value of rho (x) _i ) Or ρ (y) _j )。

If ρ (x) _i ) If not less than 0, it is the core point of the data set A, if ρ (y) _j ) If the number is more than or equal to 0, the core point is the core point of the data set B, all the core points are recorded and collected to form a set A _c And B _c Instead of the set A of core points _nc And B _nc Is represented as

And &>

Step1.2: finding a boundary set A _bd And B _bd If x _i ∈A _nc And is and

satisfy q ∈ A and q ∈ N _x ∩A _c I.e. x _i Has a core point, then x _i The boundary points are A, and the set formed by all the boundary points in A is collected and recorded as A _bd In the same way, B can be obtained _bd 。

Step1.3: the core point set A obtained from Step1.1 and Step1.2 _c ，B _c And a set of boundary points A _bd ，B _bd Calculating

And &>

If x _i ∈A _noise Then call x _i Deleting the noise points as the noise points; if y _i ∈B _noise Then call y _i It is deleted as a noise point.

Obtaining S in Step2 _B And S _W The method comprises the following specific steps:

step2.1: wherein S _B And S _W Are respectively defined as follows:

in the formula, arbitrary sample data x _i Is an n-dimensional vector, n _i Indicates the number of the i-th type samples,

representing the ith in the ith class samplej samples, m ⁱ Represents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes. />

According to the data set D = {(s) described in step2 ₁ ,d ₁ ),(s ₂ ,d ₂ )...(s _h ,d _h )...(s _w ,d _w ) The number of classes c is 2, so only two classes of sample data need to be considered, s _h Is a one-dimensional vector, so the formula in Step2.1 can be expressed as

And S _B ＝(m ¹ -m ² )(m ¹ -m ² ) ^T The calculation amount can be reduced, and the calculation efficiency can be improved.

The specific steps of training the continuous characteristic data by using the isolated forest model in Step3 are as follows:

step3.1: there are many trees in the solitary forest, and each tree is a binary tree called iTree. The nodes in the iTree either have two children, a left child and a right child, or are themselves leaf nodes. Training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:

step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of iTrees, and repeating the step for each tree to ensure that each tree is different.

Step3.1.2: randomly selecting data s _h And a value G of the feature.

Step3.1.3: and classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in the left subnode, and placing the sample with the value of the characteristic G larger than G in the right subnode.

Step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:

only one data or a plurality of same data exist in the nodes; alternatively, the height of the tree has reached the average height L of the tree in the isolated forest model.

Step3.2: and constructing each tree in the isolated forest as Step3.1 to form a forest, and after training the isolated forest model, further calculating the distance of a sample in each tree falling on a leaf node, and taking the distance as a basis for judging an abnormal point.

Step3.3: first, leaf nodes and root nodes (the root node is the topmost node of the tree and is a leaf node when no more child nodes exist in the tree after being subdivided continuously) are distinguished, the path length from a leaf node to a root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.

The anomaly score is:

the value range of S (x, n) is [0,1], and abnormal points are judged according to the value of S (x, n).

The specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:

step3.4: applying normal distribution in Gaussian model (abnormal detection model), training probability distribution model with training data set, and calculating to obtain parameter mu in probability density function _i ，σ _i Thereby obtaining a probability distribution of the data.

If a discrete variable s _i Following a gaussian distribution, the probability density function is then:

for mutually independent random variables { R) not subject to Gaussian distribution ₁ ,R ₂ ...R _i ...R _n In which R is _i ～(μi～σ _i ² ) Then the joint distribution of the n random variables is multidimensional normal distribution, and the probability density function is:

step3.5: when calculating the probability of test set data with a probability density function P (r), if the probability density function P (r) is greater than 80%, the data is considered to follow a normal distribution.

If the probability density function P (r) < ε, the data is considered abnormal data.

The gaussian distribution anomaly score is calculated as: t (r) =1-p (r).

The specific steps of Step4 are as follows:

performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:

and if the comprehensive abnormal score M (x) is larger than or equal to the threshold value E, the data is considered to be abnormal data.

If the composite anomaly score M (x) is less than the threshold E, the data is considered normal.

The invention provides a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for carrying out anomaly detection on data acquired on a server platform so as to accurately detect anomalies of a control system on the server platform in real time and find the anomalous state of platform data in time. According to the model, a DBSCAN clustering algorithm is used for preliminary denoising of data in the detection of data abnormity of a server platform, then the LDA linear judgment method is used for further data dimension reduction processing, so that the data in an original data space can be better classified in the dimension reduction process, low-dimensional data obtained through mapping can be distinguished more easily, the abnormal state of the data can be better analyzed later, the efficiency of data abnormity detection is improved, abnormity can be found more quickly, the time cost is saved, and the model has a wide application prospect.

The invention has the beneficial effects that:

1. in the process of model training for detecting the abnormality, the accuracy of the model can be reduced by high-dimensional data, the calculation complexity is high, and the abnormality cannot be well detected, so that the high-dimensional data is subjected to dimension reduction treatment by using a random forest model, the abnormal state of the data can be better analyzed by removing noise, and the abnormality detection model with high accuracy and recall rate is trained.

2. After noise is removed, feature extraction is carried out on the data subjected to noise reduction through a principal component analysis method, the data are converted into irrelevant low-dimensional new data, the new features are relatively independent and contain main information of the original data, and then an anomaly detection model with high accuracy and recall rate is trained.

3. And comprehensively considering two models respectively trained by the continuous characteristic and the discrete characteristic of the data, performing weighted calculation on the abnormal scores analyzed by the two models, and if the result is greater than a certain threshold value, determining that the data is abnormal data. Experimental results show that the detection effect of the combined anomaly detection model based on the isolated forest and the Gaussian distribution trained by the invention is better than that of a single isolated forest or Gaussian distribution.

4. The invention can improve the efficiency of data anomaly detection, find anomalies more quickly, save time cost and have wide application prospect.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a diagram illustrating distribution of characteristic values of LDA obtained after random forest denoising according to an embodiment of the present invention;

fig. 3 is a diagram illustrating the distribution of characteristic values of LDA obtained without performing random forest denoising in the embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

As shown in fig. 1, a method for detecting abnormal joint data based on an improved isolated forest algorithm and gaussian distribution includes the following steps:

step1: accessing server platform data, selecting two linear data sets A = { x = ₁ ,x ₂ ....x _i ...x _n } and B = { y ₁ ,y ₂ .....y _j .....y _n In which x is _i And y _i Respectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p And removing noise from the test data.

Step2: data set A using LDA linear decision method ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Performing further dimension reduction treatment to the set A ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Merging into a new data set D = {(s) ₁ ,d ₁ ),(s ₂ ,d ₂ )...(s _h ,d _h )...(s _w ,d _w ) In which s is _h Representing A in two data sets ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Linear data x of } linear data x _i And y _j ，d _h Representing linear data x _i And y _j W = o + p, d _h E {0,1}, when s _h When being e.g. A, d _h When s is =0 _h When being e.g. B, d _h =1, then by the formula

And &>

Respectively calculating the inter-class dispersion matrix S of the data set D _B And an intra-class dispersion matrix S _W Then, the projection matrix W of the data set D is obtained, and finallyFor each feature data x in the data set D _i Dimension reduction is carried out, and conversion is carried out to a new sample data set D' = {(s) ₁ ,d ₁ ),(s ₂ ,d ₂ )...(s _h ,d _h )...(s _w ,d _w ) And obtaining the test data with more accurate dimension classification.

Step3: sample data set after dimension reduction processing according to step2

The specific steps of Step1 are as follows:

If ρ (x) _i ) If not less than 0, it is the core point of the data set A, if ρ (y) _j ) If it is greater than or equal to 0, it is dataThe core points of the set B are recorded and collected to form a set A _c And B _c Instead of the set A of core points _nc And B _nc Is represented as

And &>

Step1.2: finding a boundary set A _bd And B _bd If x _i ∈A _nc And is and

satisfy q ∈ A and q ∈ N _x ∩A _c I.e. x _i Has a core point, then x _i The boundary points are A, and the set formed by all the boundary points in A is collected and recorded as A _bd Similarly, B can be obtained _bd 。/>

And &>

step2.1: wherein S _B And S _W Are respectively defined as follows:

S _B ＝(m ¹ -m ² )(m ¹ -m ² ) ^T

denotes the jth sample, m, of the ith class of samples ⁱ Represents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.

step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of the iTrees, and repeating the step for each tree to be trained so as to ensure that each tree is different.

Step3.1.2: randomly selecting data s _h And the value G of the feature.

Step3.3: firstly, leaf nodes and root nodes are distinguished (the root node is the topmost node of the tree, and is the leaf node when no child nodes exist after continuous subdivision), the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.

The anomaly score is:

the value range of S (x, n) is [0,1], and the abnormal point is judged according to the value of S (x, n), and the closer to 1, the more possible the abnormal point is.

step3.4: applying normal distribution in a Gaussian model (abnormal detection model), training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density function _i ，σ _i Thereby obtaining a probability distribution of the data.

If a discrete variable s _i Obeying a gaussian distribution, the probability density function is:

for mutually independent random variables { R) not subject to Gaussian distribution ₁ ,R ₂ ...R _i ...R _n In which R is _i ～(μi～σ _i ² ) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:

If the probability density function P (r) < epsilon, the data is considered as abnormal data; and selecting the value of epsilon according to the tolerance of the anomaly detection to the anomaly data and the requirement, wherein the value of epsilon is 0.05.

The gaussian distribution anomaly score is calculated as: t (r) =1-p (r).

The abnormal condition is determined according to the abnormal fraction of the Gaussian distribution, and the closer T (r) is to 1, the more abnormal points are.

The specific steps of Step4 are as follows:

the threshold E for the anomaly score was 0.45 by experiment on the test set.

The invention simulates the data set collected by the system server platform by using KDDCUP99 as the data set of the network layer and TEP as the data set of the controller and the field device layer, firstly uses DBSCAN clustering algorithm to carry out preliminary de-noising on the two data sets, finds out noise points and removes the noise points, then integrates the data set collected by the simulation system server platform into one data set by using KDDCUP99 as the data set of the network layer and TEP as the data set of the controller and the field device layer, and classifies the data, and then after classifying and extracting the data by LDA linear judgment method, the data characteristics can be converted into 11 dimensions, the 11 dimensional characteristics are not related with each other, and represent the main information of the original data.

The specific result is shown in fig. 2 and fig. 3, compared with the method of directly using the LDA linear decision after filtering out the noise by the DBSCAN clustering algorithm, after filtering the features by the DBSCAN clustering algorithm, the covariance matrix feature value distribution of the sample is more obvious, and the dimension reduction effect on the data is better. Generally, the method for performing dimensionality reduction on server platform data is mainly a Principal Component Analysis (PCA) method, but classification information is not considered, so that after dimensionality reduction is performed, although information loss is reduced to be low, classification may become more difficult. And the data in the original data space can be better classified in the dimension reduction process by using the LDA linear judgment method, and the low-dimensional data obtained by mapping is easier to distinguish.

And selecting an attack type PROBE in a KDDCUP99 data set and normal data for the data after dimension reduction to train the isolated forest model. And comprehensively considering the continuous characteristic and discrete characteristic training results of the data, and comprehensively analyzing the accuracy, recall rate and F1 value of the experimental results. Experiments compare the accuracy, recall rate and F1 value of the improved isolated forest algorithm and the Gaussian-distributed combined data anomaly detection model with the accuracy, recall rate and F1 value of the single isolated forest model through different samples of iTrees in the isolated forest. And comparing the performance of the improved isolated forest algorithm and the performance of the Gaussian distribution combined data anomaly detection model with that of the single Gaussian distribution model.

When the number of samples of the tree is 250, the accuracy, the recall rate and the F1 value of a combined data anomaly detection model of the improved isolated forest algorithm and the Gaussian distribution of the TEP data set are [0.9178743961352657,0.95,0.933660 9336609336]. The accuracy, the recall rate and the F1 value of Gaussian distribution of the TEP data set are [0.8115 942028908, 0.84, 0.825582555282528256 ]. From the experiment results of the KDDCUP99 data set and the TEP data set, the accuracy, the recall rate and the F1 value of the improved isolated forest algorithm and the combined data anomaly detection model of Gaussian distribution are better than those of the results of single detection only by using the isolated forest model or the Gaussian distribution model, and experiments show that the model has the best effect when the number of iTrees in the isolated forest is 250.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution is characterized in that:

step1: accessing server platform data, selecting two linear data sets A = { x = { (X) } ₁ ,x ₂ ....x _i ...x _n And B = { y = ₁ ,y ₂ .....y _j .....y _n In which x _i And y _i Respectively representing the linear data of the two selected data sets, then finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o } and B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Denoising the test data;

step2: data set A using LDA linear decision method ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o } and B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Performing dimension reduction treatment to the set A ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Merge into a new data set D = {(s) ₁ ,d ₁ ),(s ₂ ,d ₂ )...(s _h ,d _h )...(s _w ,d _w ) In which s is _h Representing A in two data sets ¹ ＝{x ₁ ,x ₂ ....x _i ...x _o And B ¹ ＝{y ₁ ,y ₂ .....y _j .....y _p Linear data x of } linear data x _i And y _j ，d _h Representing linear data x _i And y _j W = o + p, d _h E {0,1}, when s _h When being e.g. A, d _h When s is =0 _h ∈BWhen d is greater than _h =1, then by the formula

And &>

Respectively calculating the inter-class dispersion matrix S of the data set D _B And an intra-class dispersion matrix S _W Then, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected to _i Dimension reduction is carried out, and conversion is carried out to a new sample data set D' = {(s) ₁ ,d ₁ ),(s ₂ ,d ₂ )...(s _h ,d _h )...(s _w ,d _w ) Obtaining more accurate test data of dimension classification;

step3: sample data set D' = {(s) after dimension reduction processing according to step2 ₁ ,d ₁ ),(s ₂ ,d ₂ )...(s _h ,d _h )...(s _w ,d _w ) Dividing data of the data set D into data of continuous features and data of discrete features, training the continuous feature data by using an isolated forest model, and judging whether the continuous feature data are normal or not according to abnormal scores S (x, n) of the isolated forest; training discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of a Gaussian distribution abnormal score T (r);

2. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step1 comprises the following specific steps:

step1.1: accessing server platform data, selecting two linear data sets A = { x = ₁ ,x ₂ ....x _i ...x _n And B = { y = ₁ ,y ₂ .....y _j .....y _n Is given by the formula

Separately evaluating each data x _i And y _i Density value of rho (x) _i ) Or ρ (y) _j )；

If ρ (x) _i ) If not less than 0, it is the core point of the data set A, if ρ (y) _j ) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formed _c And B _c Instead of the set A of core points _nc And B _nc Is shown as

And

step1.2: finding a boundary set A _bd And B _bd If x _i ∈A _nc And is and

satisfy q ∈ A and q ∈ N _x ∩A _c I.e. x _i Has a core point, then x _i The boundary points are A, and the set formed by all the boundary points in A is collected and recorded as A _bd Similarly, B can be obtained _bd ；

And &>

If x _i ∈A _noise Then call x _i Deleting the noise points as the noise points; if y _i ∈B _noise Then is called y _i It is deleted as a noise point.

3. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein S is obtained in Step2 _B And S _W The method comprises the following specific steps:

step2.1: wherein S _B And S _W Are respectively defined as follows:

S _B ＝(m ¹ -m ² )(m ¹ -m ² ) ^T

represents the jth sample, m, of the ith class of samples ⁱ Represents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.

4. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein the specific steps of training the continuous feature data by using the isolated forest model in Step3 are as follows:

step3.1: training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:

step3.1.1: randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of iTrees, and repeating the step for each tree to ensure that each tree is different;

step3.1.2: randomly selecting data s _h A characteristic G and a value G of the characteristic;

step3.1.3: classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in a left subnode, and placing the sample with the value of the characteristic G larger than G in a right subnode;

only one data or a plurality of same data or tree heights in the nodes reach the average height L of the trees in the isolated forest model;

step3.2: constructing each tree in the solitary forest as Step3.1 to form a forest, and after training the model of the solitary forest, further calculating the distance of a sample in each tree falling on a leaf node and taking the distance as a basis for judging an abnormal point;

step3.3: firstly, leaf nodes and root nodes are distinguished, the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in a forest is recorded as c (x), and E (h (x)) is the expectation of all paths of a sample in the forest;

the anomaly score is:

5. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein the specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:

step3.4: applying normal distribution in a Gaussian model, training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density function _i ，σ _i Thereby obtaining a probability distribution of the data;

/>

step3.5: when calculating the probability of the test set data with the probability density function P (r), if the probability density function P (r) is greater than 80%, the data is considered to follow a normal distribution, and if the probability density function P (r) < epsilon, the data is considered to be abnormal data;

the gaussian distribution anomaly score is calculated as: t (r) =1-p (r).

6. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step4 comprises the following specific steps:

comprehensively calculating according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:

if the comprehensive abnormal score M (x) is larger than or equal to the threshold value E, the data is considered to be abnormal data;