CN111666169A

CN111666169A - Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Info

Publication number: CN111666169A
Application number: CN202010402904.2A
Authority: CN
Inventors: 田园; 原野; 马文; 黄祖源; 郭一翰
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-09-15
Anticipated expiration: 2040-05-13
Also published as: CN111666169B

Abstract

The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection. The invention analyzes the platform data, adopts an improved joint detection model of isolated forest and Gaussian distribution to detect data abnormity, and adopts a detection model of Gaussian distribution and isolated forest combination to accurately detect abnormity of an industrial control system on the industrial internet platform in real time aiming at two continuous and discrete characteristics of the industrial internet platform, thereby finding out the abnormal state of the system in time. And comparing the detection effect of the model based on the combination of the isolated forest and the Gaussian distribution trained by the invention with the detection effect of a single Gaussian distribution or isolated forest model. The efficiency of data anomaly detection is improved in the detection of server platform data anomaly, anomaly can be found more quickly, time cost is saved, and the method has a wide application prospect.

Description

Improved isolated forest algorithm and Gaussian distribution-based combined data anomaly detection method

Technical Field

The invention relates to a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, and belongs to the technical field of data anomaly detection of a service monitoring server platform.

Background

A server is one of computers that runs faster, is more heavily loaded, and is more expensive than a regular computer. The server provides calculation or application services for other clients (such as terminals like PC, smart phone, ATM and the like and even large equipment like train systems and the like) in the network. The server platform data is acquired by the data acquisition unit and comprises data of an industrial control system field device layer, a control system and a network layer, the data contains a plurality of important information, the most important information is that the abnormal state of the server system can be analyzed from the information, and an abnormal event of the server system is likely to cause huge production economic loss and even seriously endanger the life health and safety of people. In the prior art, most of judgment methods for server data abnormity are that a server system platform needs to analyze acquired data of a control system and a network layer to find whether the system is abnormal, most of platform data are high-dimensional and noisy data, and in a model training process for detecting the abnormity, the high-dimensional data can reduce the accuracy of a model, the calculation complexity is high, and the abnormity cannot be well detected, so that dimension reduction processing must be performed on the high-dimensional data.

Disclosure of Invention

The invention aims to provide a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for solving the problems.

The technical scheme of the invention is as follows: a joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution comprises the following specific steps:

step 1: is connected withEntering server platform data, selecting two linear data sets A ═ x₁,x₂....x_i...x_nY and B ═ y₁,y₂.....y_j.....y_nIn which x_iAnd y_iRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pAnd denoising the test data.

Step 2: data set A using LDA linear decision method¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pPerforming further dimensionality reduction, and inputting a data set A obtained by preliminarily removing noise points in Step1¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pTo use LDA linear decision method conveniently, set A is put¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pMerge into a new data set D {(s)₁,d₁),(s₂,d₂)...(s_h,d_h)...(s_w,d_w) In which s is_hRepresenting A in two data sets¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pLinear data x of } linear data x_iAnd y_j，d_hRepresenting linear data x_iAnd y_jW ═ o + p, d_h∈ {0, 1}, when s is_h∈ A, d_hWhen s is equal to 0_h∈ B, d _h1 and then by the formula

And

respectively calculating the inter-class dispersion matrix S of the data set D_BAnd an intra-class dispersion matrix S_WThen, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected to_iReducing dimension and converting into a new sample data set D {(s)₁,d₁),(s₂,d₂)...(s_h,d_h)...(s_w,d_w) And obtaining the test data with more accurate dimension classification.

Step 3: sample data set after dimension reduction processing according to step2

According to the method, a discrete variable refers to data of which the numerical value can only be described by natural numbers, integers, counting units and the like, a continuous variable refers to data which can be taken as values at will in a certain interval, two adjacent numerical values can be subjected to infinite segmentation (namely, infinite values can be taken) as the basis, data of a data set D are divided into data of continuous characteristics and data of discrete characteristics, an isolated forest model is used for training the continuous characteristic data, and whether the continuous characteristic data are normal or not is judged according to abnormal scores S (x, n) of the isolated forest; and training the discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the abnormal score T (r) of Gaussian distribution.

Step 4: and (3) performing weighted calculation on the abnormal score S (x, n) of the isolated forest and the Gaussian distribution abnormal score T (r) obtained in Step3, comprehensively calculating abnormal scores, and judging test data according to the abnormal scores.

The specific steps of Step1 are as follows:

step1.1: accessing server platform data, selecting two linear data sets A ═ x₁,x₂....x_i...x_nY and B ═ y₁,y₂.....y_j.....y_nIs given by the formula

Separately evaluating each data x_iAnd y_iDensity value of rho (x)_i) Or ρ (y)_j)。

If ρ (x)_i) If not less than 0, it is the core point of the data set A, if ρ (y)_j) If the number of the core points is more than or equal to 0, the core points are core points of the data set B, all the core points are recorded and collected, and a set A is formed_cAnd B_cInstead of the set A of core points_ncAnd B_ncIs shown as

And

step1.2: finding a boundary set A_bdAnd B_bdIf x_i∈A_ncAnd is and

satisfy q ∈ A and q ∈ N_x∩A_cI.e. x_iHas a core point, then x_iThe boundary points are A, and the set formed by all the boundary points in A is collected and recorded as A_bdSimilarly, B can be obtained_bd。

Step1.3: the core point set A obtained from Step1.1 and Step1.2_c，B_cAnd a set of boundary points A_bd，B_bdCalculating

And

if x_i∈A_noiseThen call x_iDeleting the noise points as the noise points; if y_i∈B_noiseThen call y_iIt is deleted as a noise point.

In Step2, S is obtained_BAnd S_WThe method comprises the following specific steps:

step2.1: wherein S_BAnd S_WAre respectively defined as follows:

in the formula, arbitrary sample data x_iIs an n-dimensional vector, n_iIndicates the number of the i-th type samples,

denotes the jth sample, m, of the ith class of samplesⁱRepresents the mean of the i-th class samples, m represents the overall sample mean, and c is the number of classes.

According to the data set D {(s) described in step2₁,d₁),(s₂,d₂)...(s_h,d_h)...(s_w,d_w) The number of classes c is 2, so only two classes of sample data need to be considered, s_hIs a one-dimensional vector, so the formula in Step2.1 can be expressed as

And S_B＝(m¹-m²)(m¹-m²)^TThe calculation amount can be reduced, and the calculation efficiency can be improved.

The specific steps of training the continuous characteristic data by using the isolated forest model in Step3 are as follows:

step3.1: there are many trees in an soliton forest, and each tree is a binary tree called iTree. The nodes in the iTree either have two children, a left child and a right child, or are themselves leaf nodes. Training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:

step3.1.1: and randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of the iTrees, and repeating the step for each tree to be trained so as to ensure that each tree is different.

Step3.1.2: randomly selecting data s_hAnd the value G of the feature.

Step3.1.3: and classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in the left subnode, and placing the sample with the value of the characteristic G larger than G in the right subnode.

Step3.1.4: repeat the above steps 3.1.2 and 3.1.3 for the left and right child nodes until the following conditions are met:

only one data or a plurality of same data exist in the nodes; alternatively, the height of the tree has reached the average height L of the tree in the isolated forest model.

Step3.2: and constructing each tree in the isolated forest as Step3.1 to form a forest, and after training the isolated forest model, further calculating the distance of a sample in each tree falling on a leaf node, and taking the distance as a basis for judging an abnormal point.

Step3.3: firstly, leaf nodes and root nodes are distinguished (the root node is the topmost node of the tree, and is a leaf node when no child nodes exist after continuous subdivision), the path length from the leaf node to the root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in the forest is recorded as c (x), and E (h (x)) is the expectation of all paths of the sample in the forest.

The anomaly score is:

the value range of S (x, n) is [0,1], and abnormal points are judged according to the value of S (x, n).

The specific steps of training discrete feature data by using a Gaussian model in Step3 are as follows:

step3.4: applying normal distribution in a Gaussian model (abnormal detection model), training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density function_i，σ_iThereby obtaining a probability distribution of the data.

If a discrete variables_iFollowing a gaussian distribution, the probability density function is then:

for mutually independent random variables { R) not subject to Gaussian distribution₁,R₂...R_i...R_nIn which R is_i～(μi～σ_i ²) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:

step3.5: when calculating the probability of the test set data using the probability density function p (r), if the probability density function p (r) is greater than 80%, the data is considered to follow a normal distribution.

If the probability density function P (r) < then, the data is considered abnormal data.

The gaussian distribution anomaly score is calculated as: t (r) 1-p (r).

The specific steps of Step4 are as follows:

performing comprehensive calculation according to the orphan forest abnormal score S (x, n) obtained by the step3 models and the Gaussian model abnormal score T (r), wherein the comprehensive abnormal score M (x) is as follows:

and if the comprehensive abnormal score M (x) is greater than or equal to the threshold value E, the data is considered to be abnormal data.

If the composite anomaly score M (x) is less than the threshold E, the data is considered normal.

The invention provides a combined data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution, which is used for carrying out anomaly detection on data acquired on a server platform so as to accurately detect anomalies of a control system on the server platform in real time and find the anomalous state of platform data in time. According to the model, a DBSCAN clustering algorithm is used for preliminary denoising of data in the detection of data abnormity of a server platform, then the LDA linear judgment method is used for further data dimension reduction processing, so that the data in an original data space can be better classified in the dimension reduction process, low-dimensional data obtained through mapping can be distinguished more easily, the abnormal state of the data can be better analyzed later, the efficiency of data abnormity detection is improved, abnormity can be found more quickly, the time cost is saved, and the model has a wide application prospect.

The invention has the beneficial effects that:

1. in the process of model training for detecting the abnormality, the accuracy of the model can be reduced by high-dimensional data, the calculation complexity is high, and the abnormality cannot be well detected, so that the high-dimensional data is subjected to dimension reduction treatment by using a random forest model, the abnormal state of the data can be better analyzed by removing noise, and the abnormality detection model with high accuracy and recall rate is trained.

2. After noise is removed, feature extraction is carried out on the data subjected to noise reduction through a principal component analysis method, the data are converted into irrelevant low-dimensional new data, the new features are relatively independent and contain main information of the original data, and then an anomaly detection model with high accuracy and recall rate is trained.

3. And (3) comprehensively considering two models respectively trained by continuous features and discrete features of the data, carrying out weighted calculation on abnormal scores analyzed by the two models, and if the result is greater than a certain threshold value, considering the data as abnormal data. Experimental results show that the detection effect of the combined anomaly detection model based on the isolated forest and the Gaussian distribution trained by the invention is better than that of a single isolated forest or Gaussian distribution.

4. The invention can improve the efficiency of data anomaly detection, find anomalies more quickly, save time cost and have wide application prospect.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a diagram illustrating distribution of characteristic values of LDA obtained after random forest denoising according to an embodiment of the present invention;

fig. 3 is a diagram illustrating the distribution of characteristic values of LDA obtained without performing random forest denoising in the embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

As shown in fig. 1, a method for detecting abnormal joint data based on an improved isolated forest algorithm and gaussian distribution includes the following steps:

step 1: accessing server platform data, selecting two linear data sets A ═ x₁,x₂....x_i...x_nY and B ═ y₁,y₂.....y_j.....y_nIn which x_iAnd y_iRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pAnd denoising the test data.

Step 2: data set A using LDA linear decision method¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pPerforming further dimension reduction treatment to the set A¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pMerge into a new data set D {(s)₁,d₁),(s₂,d₂)...(s_h,d_h)...(s_w,d_w) In which s is_hRepresenting A in two data sets¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pLinear data x of } linear data x_iAnd y_j，d_hRepresenting linear numbersAccording to x_iAnd y_jW ═ o + p, d_h∈ {0, 1}, when s is_h∈ A, d_hWhen s is equal to 0_h∈ B, d_h1 and then by the formula

And

Step 3: sample data set after dimension reduction processing according to step2

The specific steps of Step1 are as follows:

And

step1.2: finding a boundary set A_bdAnd B_bdIf x_i∈A_ncAnd is and

And

step2.1: wherein S_BAnd S_WAre respectively defined as follows:

S_B＝(m¹-m²)(m¹-m²)^T

Step3.1.2: randomly selecting data s_hAnd the value G of the feature.

The anomaly score is:

the value range of S (x, n) is [0,1], and the abnormal point is judged according to the value of S (x, n), and the closer to 1, the more likely the abnormal point is.

If a discrete variable s_iFollowing a gaussian distribution, the probability density function is then:

for mutually independent randoms not obeying Gaussian distributionVariable { R₁,R₂...R_i...R_nIn which R is_i～(μi～σ_i ²) Then the joint distribution of the n random variables is a multidimensional normal distribution, and the probability density function is:

If the probability density function P (r) is less than the threshold, the data is considered as abnormal data; and for the value, the tolerance of the abnormal data is selected according to the requirement according to the abnormal detection, and the value is 0.05.

The gaussian distribution anomaly score is calculated as: t (r) 1-p (r).

The abnormal condition is determined according to the Gaussian distribution abnormal score, and the closer to 1, the T (r) is, the more abnormal points are.

The specific steps of Step4 are as follows:

the threshold E for the anomaly score was 0.45 by experiment on the test set.

The invention uses KDDCUP99 as the data set of the network layer, TEP as the data set of the controller and the field device layer to simulate the data set collected by the system server platform, firstly uses DBSCAN clustering algorithm to primarily denoise the two data sets, finds out noise points and removes the noise points, then integrates the data set collected by the simulation system server platform into a data set by using KDDCUP99 as the data set of the network layer and the TEP as the data set of the controller and the field device layer, and classifies the data, and then classifies and extracts the data by LDA linear judgment method, the data characteristics can be transformed into 11 dimensions, the 11-dimensional characteristics are not related to each other, and represent the main information of the original data.

The specific result is shown in fig. 2 and fig. 3, compared with the method of directly using the LDA linear decision after filtering out the noise by the DBSCAN clustering algorithm, after filtering the features by the DBSCAN clustering algorithm, the covariance matrix feature value distribution of the sample is more obvious, and the dimension reduction effect on the data is better. Generally, the method for performing dimensionality reduction on server platform data is mainly a Principal Component Analysis (PCA) method, but classification information is not considered, so that after dimensionality reduction is performed, although information loss is reduced to be low, classification may become more difficult. And the data in the original data space can be better classified in the dimension reduction process by using the LDA linear judgment method, and the low-dimensional data obtained by mapping is easier to distinguish.

And selecting an attack type PROBE in a KDDCUP99 data set and normal data for the data after dimension reduction to train the isolated forest model. And comprehensively considering the continuous characteristic and discrete characteristic training results of the data, and comprehensively analyzing the accuracy, the recall rate and the F1 value of the experimental results. Experiments compare the accuracy, recall and F1 values of the improved isolated forest algorithm and gaussian distributed joint data anomaly detection model with the isolated forest model alone, through different samples of itrees in the isolated forest. And comparing the performance of the improved isolated forest algorithm and the performance of the Gaussian distribution combined data anomaly detection model with that of the single Gaussian distribution model.

When the number of samples of the tree is 250, the accuracy rate and the recall rate of a combined data anomaly detection model of an improved isolated forest algorithm and Gaussian distribution of the TEP data set and an F1 value are [0.9178743961352657,0.95 and 0.9336609336609336], the accuracy rate and the recall rate of the Gaussian distribution of the TEP data set and an F1 value are [ 0.8115942028985508, 0.84 and 0.8255528255528256 ]. From the experiment results of the KDDCUP99 data set and the TEP data set, the accuracy, the recall rate and the F1 value of the improved isolated forest algorithm and the combined data anomaly detection model of Gaussian distribution are better than those of the single detection result of the isolated forest model or the Gaussian distribution model, and experiments show that the model has the best effect when the number of iTrees in the isolated forest is 250.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A joint data anomaly detection method based on an improved isolated forest algorithm and Gaussian distribution is characterized in that:

step 1: accessing server platform data, selecting two linear data sets A ═ x₁,x₂....x_i...x_nY and B ═ y₁,y₂.....y_j.....y_nIn which x_iAnd y_iRespectively representing linear data of the two selected data sets, finding noise points by using a DBSCAN clustering algorithm, and deleting the noise points to obtain two new data sets A¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pDenoising the test data;

step 2: data set A using LDA linear decision method¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pPerforming dimension reduction treatment to the set A¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pMerge into a new data set D {(s)₁,d₁),(s₂,d₂)...(s_h,d_h)...(s_w,d_w) In which s is_hRepresenting A in two data sets¹＝{x₁,x₂....x_i...x_oAnd B¹＝{y₁,y₂.....y_j.....y_pLinear data x of } linear data x_iAnd y_j，d_hRepresenting linear data x_iAnd y_jW ═ o + p, d_h∈ {0, 1}, when s is_h∈ A, d_hWhen s is equal to 0_h∈ B, d_h1 and then by the formula

And

respectively calculating the inter-class dispersion matrix S of the data set D_BAnd an intra-class dispersion matrix S_WThen, a projection matrix W of the data set D is solved, and finally, each feature data x in the data set D is subjected to_iReducing dimension and converting into a new sample data set D {(s)₁,d₁),(s₂,d₂)...(s_h,d_h)...(s_w,d_w) Obtaining more accurate test data of dimension classification;

step 3: and D' {(s) of the sample data set subjected to dimensionality reduction processing according to step2₁,d₁),(s₂,d₂)...(s_h,d_h)...(s_w,d_w) Dividing data of the data set D into data of continuous features and data of discrete features, training the continuous feature data by using an isolated forest model, and judging whether the continuous feature data are normal or not according to abnormal scores S (x, n) of the isolated forest; training discrete characteristic data by using a Gaussian distribution model, and judging whether the test data is normal or not according to the value of the Gaussian distribution abnormal score T (r);

2. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step1 comprises the following specific steps:

Separately evaluating each data x_iAnd y_iDensity value of rho (x)_i) Or ρ (y)_j)；

And

step1.2: finding a boundary set A_bdAnd B_bdIf x_i∈A_ncAnd is and

satisfy q ∈ A and q ∈ N_x∩A_cI.e. x_iHas a core point, then x_iThe boundary points are A, and the set formed by all the boundary points in A is collected and recorded as A_bdSimilarly, B can be obtained_bd；

And

3. The improved isolated forest algorithm and Gaussian distribution based joint data anomaly detection method according to claim 1, wherein S is obtained in Step2_BAnd S_WThe method comprises the following specific steps:

step2.1: wherein S_BAnd S_WAre respectively defined as follows:

S_B＝(m¹-m²)(m¹-m²)^T

4. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the specific steps of training the continuous feature data by using the isolated forest model in Step3 are as follows:

step3.1: training a binary tree iTree in an isolated forest model, wherein the training steps are as follows:

step3.1.1: randomly selecting 30% of data of the number of the original data sets from the original data sets as training data sets of iTrees, and repeating the step for each tree to ensure that each tree is different;

step3.1.2: randomly selecting data s_hA characteristic G and a value G of the characteristic;

step3.1.3: classifying all data according to the selected characteristic G, placing the sample with the value of the characteristic G smaller than G in a left subnode, and placing the sample with the value of the characteristic G larger than G in a right subnode;

only one data or a plurality of same data or tree heights in the nodes reach the average height L of the trees in the isolated forest model;

step3.2: constructing each tree in the solitary forest as Step3.1 to form a forest, and after training the model of the solitary forest, further calculating the distance of a sample in each tree falling on a leaf node and taking the distance as a basis for judging an abnormal point;

step3.3: firstly, distinguishing leaf nodes and root nodes, wherein the path length from a leaf node to a root node in each tree is recorded as h (x), the average path length from the leaf node to the root node of a plurality of trees in a forest is recorded as c (x), and E (h (x)) is the expectation of all paths of a sample in the forest;

the anomaly score is:

5. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step3 for training the discrete feature data by using the Gaussian model comprises the following specific steps:

step3.4: applying normal distribution in a Gaussian model, training the probability distribution model by using a training data set, and calculating to obtain a parameter mu in a probability density function_i，σ_iThereby obtaining a probability distribution of the data;

step3.5: when calculating the probability of the test set data by using a probability density function P (r), if the probability density function P (r) is more than 80%, the data is considered to be subjected to normal distribution, and if the probability density function P (r) is less, the data is considered to be abnormal data;

the gaussian distribution anomaly score is calculated as: t (r) 1-p (r).

6. The method for detecting the abnormal data based on the improved isolated forest algorithm and the Gaussian distribution as claimed in claim 1, wherein the Step4 comprises the following specific steps:

if the comprehensive abnormal score M (x) is larger than or equal to the threshold value E, the data is considered to be abnormal data;