CN117520954A

CN117520954A - Abnormal data reconstruction method and system based on isolated forest countermeasure network

Info

Publication number: CN117520954A
Application number: CN202311275888.5A
Authority: CN
Inventors: 陈凌; 韩伟; 宋云飞; 杨东升; 李清波; 孙红兵; 石慧; 黄玉辉
Original assignee: Nanjing Duli Technology Co ltd; Shanghai Puyuan Technology Co ltd; Huaiyin Normal University; HuaiAn Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Nanjing Duli Technology Co ltd; Shanghai Puyuan Technology Co ltd; Huaiyin Normal University; HuaiAn Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-02-06

Abstract

The invention discloses an abnormal data reconstruction method and system based on an isolated forest countermeasure network, and relates to the technical field of abnormal data reconstruction. The invention provides an abnormal data reconstruction method based on an isolated forest countermeasure network, which can efficiently identify abnormal data in a load by constructing an isolated forest model; in view of the strong learning capacity of the neural network, a GRU network construction generator is adopted, cosine similarity and a Smooth L1 function are introduced to improve a loss function, and the problems of low convergence speed of a GAN model and low data reconstruction accuracy are solved; experimental results show that compared with the traditional GAN, KNN and other multiple missing data filling models, under different missing modes, the reconstruction effect of the CGAN is optimal, the quality of a load data set can be effectively improved, and therefore the data filling precision is further improved. The method can fully excavate the nonlinear relation between the load and the influencing factors, improves the accuracy of data analysis and modeling, and provides beneficial effects for the operation and management of the power system.

Description

Abnormal data reconstruction method and system based on isolated forest countermeasure network

Technical Field

The invention relates to the field of abnormal data reconstruction, in particular to an abnormal data reconstruction method and system based on an isolated forest countermeasure network.

Background

With the increasing maturity of the application of communication acquisition technology in power systems, power companies can conveniently acquire real-time load data. However, due to various unreliability factors, the collected load data has the problems of abnormality, missing and the like, the original distribution of the data set is destroyed by the abnormal data, meanwhile, the data redundancy is insufficient, and the improvement of the load prediction precision is hindered. The invention provides an abnormal data reconstruction method based on an isolated forest countermeasure network. And (3) eliminating abnormal points of the data by constructing an abnormal data identification model based on an isolated forest. After the missing data set is obtained, a condition generation countermeasure network missing data reconstruction model is constructed, a load influence factor is used as a condition to generate condition constraint of the countermeasure network, a weighting loss function is introduced, the convergence speed and the data reconstruction accuracy of the model are improved, and the data missing points are filled.

The existing data filling method has some defects in the aspects of abnormal data identification, missing data reconstruction and the like. The method provided by the invention adopts the isolated forest algorithm to identify the abnormal data of the load data, and has higher accuracy and efficiency. Compared with the traditional abnormal data detection method, the isolated forest can effectively identify abnormal data points which deviate from the whole change range of the data set farther. Meanwhile, the missing data is filled by using the condition generation countermeasure network, the nonlinear relation between the load and the influence factors can be fully mined, a more accurate load sample is generated, the accuracy of data reconstruction is improved, and compared with the traditional filling method, the method can better improve the accuracy and the authenticity of data filling.

Disclosure of Invention

The invention is provided in view of the existing data filling method, which has some existing problems in the aspects of abnormal data identification and missing data reconstruction.

Therefore, the problem to be solved by the invention is how to realize the elimination of abnormal points of data by constructing an abnormal data identification model based on an isolated forest, and fill up the abnormal points of data by constructing a reconstruction model based on condition generation and antagonism network missing data.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides an anomaly data reconstruction method based on an isolated forest countermeasure network, which includes acquiring power grid history data, constructing and training an isolated forest model, and performing parameter optimization on an isolated forest by using a grid search algorithm; inputting the load data into an isolated forest model for identification so as to delete abnormal data and obtain a missing data set; normalizing the original load data to eliminate the influence of the data dimension on model training, and dividing the complete abnormal-free load data into a training set; constructing a generator model and a discriminator model based on the CGAN, carrying out weight update on the generator model and the discriminator model, and after the generator model and the discriminator model are in balance in game, storing the CGAN model and updating the loss function; the missing data set is input into the saved generator model, the generated sample is output, and the missing data is filled.

As a preferable scheme of the anomaly data reconstruction method based on the isolated forest countermeasure network, the invention comprises the following steps: the construction and training of the isolated forest model comprises the following steps: randomly selecting n sample subsets from the training set data, and conveying the sample subsets to the root of the isolated tree; if the sample subset contains m features, randomly selecting one feature from the m features, and randomly generating a number between the maximum value and the minimum value of the feature as a division point g; after the feature dimension is selected, a plane is formed by the isolated forest randomly through a segmentation point g, and a data space formed by a sample subset is segmented into two subspaces according to a certain dimension direction; repeatedly dividing the two subspaces for a plurality of times to continuously form new data spaces until each data space only contains an abnormal index of one data point y or the space reaches a set depth; repeatedly executing the steps to generate a plurality of isolated trees; the specific formula of the abnormality index of the data point y is as follows:

where S (y, m) is the outlier score of data point y, E (L (y)) is the expected value of y for path L (y) in multiple trees, and C (m) is the average path length of the orphan tree.

Wherein, the specific formula of the average path length C (m) of the isolated tree is as follows:

where ζ is the Euler constant and m is the data point contained in the data space.

As a preferable scheme of the anomaly data reconstruction method based on the isolated forest countermeasure network, the invention comprises the following steps: the specific formula for carrying out normalization processing on the original load data is as follows:

wherein a is _n A is the result of normalization of the data _max And a _min And respectively representing the maximum value and the minimum value of the normalized result of the data, wherein a is a data initial value.

As a preferable scheme of the anomaly data reconstruction method based on the isolated forest countermeasure network, the invention comprises the following steps: the construction of the generator model and the discriminator model based on the CGAN comprises the following steps: the generator model consists of three layers of GRU networks and one layer of fully connected network; remodelling the training set data into a three-dimensional matrix with batch size, time step and data dimension, and inputting the three-dimensional matrix into a GRU network; after the three-dimensional input matrix is operated by 3 layers of GRU networks, matrixes G1, G2 and G3 are sequentially obtained, and the neuron numbers of the 3 layers of GRU networks are sequentially set to 128, 64 and 64; the matrix G3 is input into the full connection layer to output a generated sample, and the generated sample is sent to the discriminator; combining and inputting the random noise z and the condition c into a generator, and outputting and generating a sample G (z|c) by the generator; the discriminator consists of a 3-layer CNN and a 1-layer fully connected network; the first 2 layers of CNNs adopt 32 and 64 5*5 convolution kernels, the step length is set to be 2, the input matrix is sequentially subjected to 2 layers of CNNs feature extraction to obtain C1 and C2 convolution layers, the 3 rd layer of CNNs adopts 16 3*3 convolution kernels, and the step length is set to be 1; introducing a gradient penalty mechanism into the WGAN-GP model, selecting a LeakyReLU as an activation function, and outputting an input matrix discrimination result by a full-connection layer; applying supervised learning to the CAN, wherein the CGAN reserves a game structure of the CAN, and adding a condition value into the input of the generator and the discriminator; the input of the discriminator is the combination of the load data true value t and the condition c and the combination of the generated sample G (z|c) and the condition c, and the discriminator needs to discriminate whether the distribution between the generated sample and the real sample is similar or not and whether the generated sample meets the condition c or not; the generator and the arbiter update the network parameters, the loss function and the objective function according to the discrimination result.

As a preferable scheme of the anomaly data reconstruction method based on the isolated forest countermeasure network, the invention comprises the following steps: the specific formula of the loss function is as follows:

L _G ＝-E _(z,c) [D(G(z|c)|c)]

L _D ＝-E _(t,c) [D(t|c)]+E _(z,c) [D(G(z|c)|c)]

wherein E represents expected values of the corresponding distribution, G (to) represents samples generated by a generator, D (to) represents discrimination of the authenticity of the input samples by the discriminator, D (G (z|c) |c) is a discrimination result of the discriminator on the generated samples G (z|c) under the condition c, D (t|c) is a discrimination result of the discriminator on the real samples t under the condition c, G (z|c) is a sample generated by the generator, E _(t,c) Is a desire for noise t and condition c.

The specific formula of the CGAN objective function is as follows:

where lambda is the gradient penalty coefficient, E represents the expected value of the corresponding distribution,gradient of the discriminator function, E _(t,c) Is the expectation of the noise t and the condition c, and D (G (z|c) |c) is the discrimination result of the discriminator for the generated sample G (z|c) under the condition c.

As a preferable scheme of the anomaly data reconstruction method based on the isolated forest countermeasure network, the invention comprises the following steps: after the to-be-generated model and the arbiter model game reach balance, storing the CGAN model and updating the loss function comprises the following steps: selecting Wasserstein distance as a loss function of the discriminator, and introducing a gradient penalty mechanism to realize that a real and false sample concentration region and a cross transition region thereof apply Lipschitz constraint; the Smooth L1 loss function has stronger robustness and stability in the solving process, gradient explosion is effectively avoided, and the network convergence speed is improved; the cosine similarity function is introduced to measure the difference of the two vectors by calculating the cosine value of the included angle between the vectors, the cosine similarity has higher accuracy on judging the similarity of the vectors, and the track change trend of the vectors is identified.

As a preferable scheme of the anomaly data reconstruction method based on the isolated forest countermeasure network, the invention comprises the following steps: the specific formula of the Smooth L1 loss function is as follows:

where t is the true value of the data and G (z|c) is the sample generated by the generator.

The concrete formula of the cosine similarity function is as follows:

wherein G is _i The ith sample, t, generated by the generator _i And n is the total number of samples, and is the true value of the ith sample.

The specific formula of the CGAN loss function is as follows:

wherein lambda is ₁ And lambda (lambda) ₂ The weighting coefficients corresponding to the loss functions are respectively,objective function for achieving balance of generator model and arbiter model game, L _SL1 Is a Smooth L1 loss function, L _cos As a cosine similarity function.

In a second aspect, an embodiment of the present invention provides an anomaly data reconstruction system based on an orphan forest d countermeasure network, including: the isolated forest model training module is used for constructing and training an isolated forest model for anomaly detection and data identification; the grid search algorithm module is used for carrying out parameter optimization on the isolated forest model; the data normalization processing module is used for performing normalization processing on the original load data; the generator model module generates a sample with authenticity according to the input missing data; the discriminator model module is used for discriminating whether the generated sample is real data or generated data; and the data filling module is used for inputting the missing data set into the generator model for filling.

In a third aspect, embodiments of the present invention provide a computer apparatus comprising a memory and a processor, the memory storing a computer program, wherein: the computer program instructions, when executed by a processor, implement the steps of the outlier data reconstruction method based on an orphan forest d countermeasure network according to the first aspect of the invention.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, wherein: the computer program instructions, when executed by a processor, implement the steps of the outlier data reconstruction method based on an orphan forest d countermeasure network according to the first aspect of the invention.

The invention has the beneficial effects that: the invention provides an abnormal data reconstruction method based on an isolated forest countermeasure network, which can efficiently identify abnormal data in a load by constructing an isolated forest model; in view of the strong learning capacity of the neural network, a GRU network construction generator is adopted, cosine similarity and a Smooth L1 function are introduced to improve a loss function, and the problems of low convergence speed of a GAN model and low data reconstruction accuracy are solved; experimental results show that compared with the traditional GAN, KNN and other multiple missing data filling models, under different missing modes, the reconstruction effect of the CGAN is optimal, the quality of a load data set can be effectively improved, and therefore the data filling precision is further improved. The method can fully excavate the nonlinear relation between the load and the influencing factors, improves the accuracy of data analysis and modeling, and provides beneficial effects for the operation and management of the power system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

fig. 1 is a flowchart of an anomaly data reconstruction method based on an isolated forest countermeasure network in embodiment 1.

Fig. 2 is a schematic diagram showing the result of detecting abnormal values of load data in an isolated forest, which is an abnormal data reconstruction method based on an isolated forest countermeasure network in example 1.

Fig. 3 is a schematic structural diagram of a CGAN model of embodiment 1, which is a method for reconstructing abnormal data based on an isolated forest countermeasure network.

Fig. 4 is a schematic diagram of a network structure of a CGAN generator according to embodiment 2, which is an anomaly data reconstruction method based on an isolated forest countermeasure network.

Fig. 5 embodiment 2 is a schematic diagram of a network structure of a CGAN discriminator based on an anomaly data reconstruction method of an orphan forest countermeasure network.

Fig. 6 embodiment 2 is a schematic diagram of an isolated forest and CGAN based anomaly data reconstruction framework for an anomaly data reconstruction method based on an isolated forest countermeasure network.

Fig. 7 embodiment 2 is a schematic diagram of a filling result with a missing rate of 40% in a continuous missing mode based on an anomaly data reconstruction method of an isolated forest countermeasure network.

Fig. 8 embodiment 2 is a schematic diagram of a filling result with a missing rate of 40% in a random missing mode based on an anomaly data reconstruction method of an isolated forest countermeasure network.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Example 1

Referring to fig. 1 to 3, a first embodiment of the present invention provides an anomaly data reconstruction method based on an orphan forest to combat network, comprising,

s1: and acquiring historical data of the power grid, constructing and training an isolated forest model, and carrying out parameter optimization on the isolated forest by adopting a grid search algorithm.

S1.1: the power grid historical data are obtained and mainly comprise data such as time factors (year, month, day, time, minute, day of week and holiday), climate factors (wind speed, air temperature, day maximum temperature value and day maximum temperature value), load factors (day average load, day maximum load and day minimum load) and the like.

S1.2: and constructing and training an isolated forest model.

S1.2.1: n sample subsets are randomly selected from the training set data and delivered to the root of the orphan tree.

S1.2.2: if the sample subset contains m features, randomly selecting one feature from the m features, and randomly generating a number between the maximum value and the minimum value of the feature as a division point g.

S1.2.3: after the feature dimension is selected, the isolated forest is formed into a plane randomly through the segmentation point g, and the data space formed by the sample subset is segmented into two subspaces according to a certain dimension direction.

S1.2.4: the two subspaces are iteratively segmented for a plurality of times to continuously form new data spaces until each data space only contains an abnormal index of one data point y or the space reaches a set depth.

Specifically, the specific formula of the anomaly index for data point y is as follows:

Specifically, when the anomaly score is much less than 0.5, the data point has no anomaly data; when the anomaly score is approximately equal to 1, the data point has anomaly data.

S1.2.5: the above steps are repeatedly performed to generate a plurality of orphaned trees.

S1.3: and carrying out parameter optimization on the isolated forest by adopting a grid search algorithm.

Further, the number of the isolated trees and the continuity are main parameters affecting the performance of the isolated forest, the number of the isolated trees is determined to be 100 by using a grid search algorithm, the continuity parameters are set according to the proportion of estimated abnormal points, and an isolated forest model is built according to the parameters.

S2: and inputting the load data into the isolated forest model for identification so as to delete the abnormal data and obtain a missing data set.

Specifically, an isolated forest is an unsupervised anomaly detection algorithm that identifies outliers based on the density of data samples. By constructing a randomly partitioned binary tree, an isolated forest separates normal samples from abnormal samples, in which the abnormal samples tend to be more easily isolated on shorter paths of the tree than the normal samples.

Furthermore, the constructed isolated forest model can be used for carrying out abnormal data identification on the load data set, and for each data sample, the degree of abnormality can be measured by calculating the average path length of the data sample. The smaller the average path length, the more likely that the samples are isolated, i.e., are outliers, the data samples that are considered outliers may be deleted based on the threshold of the degree of outliers.

Further, as shown in fig. 2, circles represent normal load data, triangles represent missing load data, squares represent abnormal load values detected by isolated forests, and as can be seen from fig. 2, the isolated forests can be effectively identified due to the abnormal data and the vacant data caused by meter flying and communication faults, and after the abnormal data is deleted, a missing load data set can be obtained.

S3: and carrying out normalization processing on the original load data to eliminate the influence of the data dimension on model training, and dividing the complete and abnormal-free load data into a training set.

Specifically, the original load data set needs to be normalized in consideration of the difference of different types of data sizes. The conventional methods include Max-Min normalization and Z-Socre normalization, wherein the Max-Min normalization method is suitable for non-normal distribution data, and the specific formula for carrying out normalization processing on the original load data by using the Max-Min in the embodiment is as follows:

Further, the normalized data is divided into a training set and a test set, wherein 10 months of data in total from 11 months in 2012 to 8 months in 2013 are the training set, and 4 months of data in total from 9 months in 2013 to 12 months in 2013 are the test set.

S4: and constructing a generator model and a discriminator model based on the CGAN, carrying out weight update on the generator model and the discriminator model, and after the generator model and the discriminator model are in balance in game, storing the CGAN model and updating the loss function.

Specifically, in actual training, when the CGAN processes high-dimensional data, the convergence speed is slow, the problems of under fitting, gradient disappearance and the like easily occur, meanwhile, compared with a fully connected network, the convolutional neural network has a simple structure and strong feature extraction capability, and can efficiently process a high-dimensional data set, so that the embodiment uses the CNN to construct a discriminator, better processes time sequence information, improves the convergence speed of a generator, and the generator is constructed by adopting a GRU network.

S4.1: a generator model based on CGAN is constructed.

S4.1.1: the generator model is composed of three layers of GRU networks and one layer of fully connected network.

S4.1.2: and (3) remodelling the training set data into a three-dimensional matrix with batch size, time step and data dimension, and inputting the three-dimensional matrix into a GRU network.

S4.1.3: after the three-dimensional input matrix is operated by 3 layers of GRU networks, the matrices G1, G2 and G3 are sequentially obtained, and the neuron numbers of the 3 layers of GRU networks are sequentially set to 128, 64 and 64.

S4.1.4: the matrix G3 is input into the full connection layer to output a generated sample, and the generated sample is sent to the discriminator.

S4.2: and constructing a discriminator model based on the CGAN.

S4.2.1: the discriminator consists of a 3-layer CNN and a 1-layer fully connected network.

S4.2.2: the first 2 layers of CNNs adopt 32 and 64 5*5 convolution kernels respectively, the step length is set to be 2, the input matrix is sequentially subjected to 2 layers of CNNs feature extraction to obtain C1 and C2 convolution layers, the 3 rd layer of CNNs adopts 16 3*3 convolution kernels, and the step length is set to be 1.

S4.3: after the generator model and the arbiter model play to reach balance, the CGAN model is saved and the loss function is updated.

Specifically, the supervised learning is applied to the CAN, the CGAN reserves the game structure of the CAN, and as shown in figure 3, a condition value is added into the input of the generator and the discriminator; combining and inputting the random noise z and the condition c into a generator, and outputting and generating a sample G (z|c) by the generator; the input of the discriminator is the combination of the load data true value t and the condition c and the combination of the generated sample G (z|c) and the condition c, and the discriminator needs to discriminate whether the distribution between the generated sample and the real sample is similar or not and whether the generated sample meets the condition c or not; the generator and the arbiter update the network parameters, the loss function and the objective function according to the discrimination result.

Further, the specific formulas of the loss functions of the generator and the arbiter in the CGAN are as follows:

L _G ＝-E _(z,c) [D(G(z|c)|c)]

L _D ＝-E _(t,c) [D(t|c)]+E _(z,c) [D(G(z|c)|c)]

wherein E represents expected values of the corresponding distribution, G (to) represents samples generated by a generator, D (to) represents discrimination of the authenticity of the input samples by the discriminator, D (G (z|c) |c) is a discrimination result of the discriminator on the generated samples G (z|c) under the condition c, D (t|c) is a discrimination result of the discriminator on the real samples t under the condition c, G (z|c) is a sample generated by the generator, E _(z,c) For the desire of noise z and condition c, E _(t,c) Is a desire for noise t and condition c.

Specifically, the generator increases the authenticity of the generated sample G (z|c) through continuous iteration, and the arbiter wants to decrease the authenticity of the generated data and increase the accuracy of the true sample discrimination, so CGAN is gradually balanced in the two games, and the calculation formula of the objective function can be defined as:

Furthermore, a gradient penalty mechanism is introduced into the WGAN-GP model, the LeakyReLU is selected as an activation function, and the full-connection layer outputs an input matrix discrimination result.

Specifically, the Wasserstein distance is selected as a loss function of the discriminator, and a gradient penalty mechanism is introduced, so that Lipschitz constraint is applied to a true and false sample concentrated region and a cross transition region thereof, and an objective function of the CGAN can be redefined as follows:

wherein lambda is the gradient penalty coefficient,gradient of the discriminator function, E _(t,c) Is the expectation of the noise t and the condition c, and D (G (z|c) |c) is the discrimination result of the discriminator for the generated sample G (z|c) under the condition c.

Furthermore, the Smooth L1 loss function has stronger robustness and stability in the solving process, gradient explosion is effectively avoided, and the network convergence speed is improved; the specific formula of the Smooth L1 loss function is as follows:

Specifically, a cosine similarity function is introduced to measure the difference of two vectors by calculating the cosine value of an included angle between the vectors, the cosine similarity has higher accuracy on distinguishing the vector similarity, the track change trend of the vectors is identified, and the concrete formula of the cosine similarity function is as follows:

Furthermore, in order to improve the network convergence rate, the embodiment introduces cosine similarity and a smoothl1 function on the basis of improving the authenticity of the generated sample, and the specific formula of the CGAN loss function is as follows:

wherein lambda is ₁ And lambda (lambda) ₂ The weighting coefficients corresponding to the loss functions are respectively,objective function for achieving balance of generator model and arbiter model game, L _SL1 As SmoothL1 loss function, L _cos As a cosine similarity function.

S5: the missing data set is input into the saved generator model, the generated sample is output, and the missing data is filled.

Specifically, when the CGAN model is trained, a mode of alternately training a generating model and a discriminating model is generally adopted, and in the process of training the generating model, the weight value of the generating model needs to be set according to the constraint of 3 aspects of deviation of prediction data and real data generated by the generating model, the discriminating result of the discriminating model and the characteristic vector deviation. In the process of training the discrimination model, the condition data and the prediction data generated by the generation model are required to be input into the discrimination model, the discrimination model is required to discriminate the probability that the input data is the real load data to be predicted, the parameters of the discrimination model are updated according to the discrimination deviation, an Adam optimizer and a random gradient descent algorithm are applied in the training process, the generation model is updated after the discrimination model is updated once, and the two steps are repeatedly executed until the model reaches balance.

Further, in order to more specifically evaluate the accuracy of the reconstruction of different model data, the reconstruction accuracy and the R square are selected as evaluation indexes of the data reconstruction, and the closer the reconstruction accuracy and the R square are to 1, the higher the model reconstruction authenticity is, and the specific formula is as follows:

wherein,r square index is reconstructed for data, E _acc Z is the reconstruction accuracy index of the data _g Representing the generated value of the load, z _t Representing the true value of the load, z _m Representing the average of the missing data.

Further, the embodiment also provides an abnormal data reconstruction system based on the isolated forest countermeasure network, which comprises: the isolated forest model training module is used for constructing and training an isolated forest model for anomaly detection and data identification; the grid search algorithm module is used for carrying out parameter optimization on the isolated forest model; the data normalization processing module is used for performing normalization processing on the original load data; the generator model module generates a sample with authenticity according to the input missing data; the discriminator model module is used for discriminating whether the generated sample is real data or generated data; and the data filling module is used for inputting the missing data set into the generator model for filling.

The embodiment also provides a computer device, which is suitable for the situation of an abnormal data reconstruction method based on an isolated forest countermeasure network, and comprises a memory and a processor; the memory is configured to store computer executable instructions, and the processor is configured to execute the computer executable instructions to implement the anomaly data reconstruction method based on the orphan forest to combat network as set forth in the above embodiment.

The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

The present embodiment also provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of: acquiring historical data of a power grid, constructing and training an isolated forest model, and carrying out parameter optimization on the isolated forest by adopting a grid search algorithm; inputting the load data into an isolated forest model for identification so as to delete abnormal data and obtain a missing data set; normalizing the original load data to eliminate the influence of the data dimension on model training, and dividing the complete abnormal-free load data into a training set; constructing a generator model and a discriminator model based on the CGAN, carrying out weight update on the generator model and the discriminator model, and after the generator model and the discriminator model are in balance in game, storing the CGAN model and updating the loss function; the missing data set is input into the saved generator model, the generated sample is output, and the missing data is filled.

In summary, the invention provides an anomaly data reconstruction method based on an isolated forest countermeasure network, which can efficiently identify anomaly data in a load by constructing an isolated forest model; in view of the strong learning capacity of the neural network, a GRU network construction generator is adopted, cosine similarity and a Smooth L1 function are introduced to improve a loss function, and the problems of low convergence speed of a GAN model and low data reconstruction accuracy are solved; experimental results show that compared with the traditional GAN, KNN and other multiple missing data filling models, under different missing modes, the reconstruction effect of the CGAN is optimal, the quality of a load data set can be effectively improved, and therefore the data filling precision is further improved. The method can fully excavate the nonlinear relation between the load and the influencing factors, improves the accuracy of data analysis and modeling, and provides beneficial effects for the operation and management of the power system.

Example 2

Referring to fig. 4 to 8, in order to verify the beneficial effects of the present invention, a second embodiment of the present invention provides an anomaly data reconstruction method based on an isolated forest countermeasure network, and scientific demonstration is performed through economic benefit calculation and simulation experiments.

Specifically, the generator is composed of a three-layer GRU network and a one-layer fully connected network, as shown in FIG. 4, the training set data is remodelled into a three-dimensional matrix (batch size, time step, data dimension) and input into the GRU network, wherein the batch size is set to 64, the time step is set to 20, and the data dimension is set to 15 because the load information contains the load and 14 characteristic factors. After the three-dimensional input matrix is subjected to 3-layer GRU network operation, matrixes G1, G2 and G3 are sequentially obtained, the number of neurons of the 3-layer GRU network is sequentially set to 128, 64 and 64, finally, the matrix G3 is input into a full-connection layer to output a generated sample, the generated sample is divided, each group contains 15 data, and the generated sample is sent to a discriminator. The remaining parameters for the GRU network training are set as follows: the optimizer was set to Adam, the learning rate was 0.0001, the loss function was set to MSE, and the number of iterations was set to 200, as shown in table 1:

TABLE 1 GRU network training parameters

Further, as shown in fig. 5, the structural condition value of the discriminator of the CGAN model and the real sample are combined to form a first matrix, the condition value and the generated sample output by the generator are combined to form a second matrix, the two matrices are 15×15-order, the two matrices are respectively input into the discriminator, and the discriminator needs to extract and identify the characteristics of the input matrix. The discriminator consists of 3 layers of CNN and 1 layer of fully connected network, the former 2 layers of CNN respectively adopt 32 and 64 convolution kernels of 5*5, the step length is set to be 2, the input matrix is sequentially obtained into C1 and C2 convolution layers after 2 layers of CNN characteristics are extracted, the CNN of the 3 rd layer adopts 16 convolution kernels of 3*3, and the step length is set to be 1. Different from the traditional GAN, since the WGAN-GP model introduces a gradient penalty mechanism, regularization processing is not performed between convolution layers in the arbiter, the inakenyllu is selected as an activation function, and the full-connection layer outputs and inputs a matrix discrimination result, and specific network parameters are shown in table 2:

table 2WGAN-GP model network parameters

Specifically, as shown in fig. 6, the reconstruction schematic of abnormal data of the isolated forest and the CGAN is that samples and real loads generated by the generator are respectively and transversely spliced with condition values, the samples and the real loads are input into the discriminator to be subjected to true and false discrimination, in the process of game between the generator and the discriminator, the nonlinear relation between the loads and influencing factors is mined, after the game between the generator and the discriminator reaches balance, the generator is stored, the obtained missing data set is input into the generator, and missing load data is filled.

Further, to verify the effectiveness of the padding model provided herein, load missing data sets with the missing rate of 10%, 20%, 40% and 60% are randomly generated in the test set, and two data missing modes caused by different factors are fully considered, wherein the missing data sets comprise two forms of random missing (the fact that the acquisition equipment is considered to be randomly abnormal to cause single sampling point data missing) and continuous missing (the fact that the acquisition equipment is considered to be in fault to cause continuous multiple sampling point data missing). And sending the training set data into the CGAN model for training, inputting the test set missing data into the generator model after model training is finished, outputting the generated sample, and filling the load missing data.

Furthermore, filling accuracy and R square are selected as evaluation indexes for filling the missing data, a GAN model, a KNN model and a mean value interpolation method are established to interpolate the missing data, the evaluation indexes are filled through comparative analysis, and the effectiveness of the method is verified. The loss function weights lambda, lambda 1 and lambda 2 of the CGAN are 40, 10 and 0.6 in sequence, the weight of KNN is set as distance, and the k value is set as 8. The parameters of the discriminators of the GAN network are the same as those of the CGAN, the generator is constructed by adopting CNNs, the generator is composed of three layers of CNNs and one layer of full-connection layer, the relevant parameters of the 2D convolution kernel are the same as those of the discriminators, the difference is that the activation functions of the three layers of CNNs are all ReLU, each layer of CNNs needs regularization, and the regularization parameters are 32, 64 and 16 in sequence.

Specifically, as shown in fig. 7, the hatched graph in the figure is the load true value, the purple line is the CGAN filling value, the orange line is the GAN filling value, the blue line is the mean value filling value, and the black line is the KNN filling value. According to the graph, aiming at the missing data in the continuous missing mode, the CGAN model can fully mine the nonlinear relation between the load and the influence factors, reconstruct the missing data more accurately, the generated load sample is closest to the true value, and the filling accuracy is superior to that of the traditional model.

Further, as shown in fig. 8, the filling result with the missing rate of 40% in the random missing mode is specifically that the shadow graph is a true load value, and the purple line, the orange line, the blue line and the black line are respectively CGAN, GAN, the mean value method and the filling value of KNN.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. An anomaly data reconstruction method based on an isolated forest countermeasure network is characterized by comprising the following steps of: comprising the steps of (a) a step of,

acquiring historical data of a power grid, constructing and training an isolated forest model, and carrying out parameter optimization on the isolated forest by adopting a grid search algorithm;

inputting the load data into an isolated forest model for identification so as to delete abnormal data and obtain a missing data set;

normalizing the original load data to eliminate the influence of the data dimension on model training, and dividing the complete abnormal-free load data into a training set;

constructing a generator model and a discriminator model based on the CGAN, carrying out weight update on the generator model and the discriminator model, and after the generator model and the discriminator model are in balance in game, storing the CGAN model and updating the loss function;

the missing data set is input into the saved generator model, the generated sample is output, and the missing data is filled.

2. The anomaly data reconstruction method based on the isolated forest countermeasure network of claim 1, wherein: the construction and training of the isolated forest model comprises the following steps:

randomly selecting n sample subsets from the training set data, and conveying the sample subsets to the root of the isolated tree;

if the sample subset contains m features, randomly selecting one feature from the m features, and randomly generating a number between the maximum value and the minimum value of the feature as a division point g;

after the feature dimension is selected, a plane is formed by the isolated forest randomly through a segmentation point g, and a data space formed by a sample subset is segmented into two subspaces according to a certain dimension direction;

repeatedly dividing the two subspaces for a plurality of times to continuously form new data spaces until each data space only contains an abnormal index of one data point y or the space reaches a set depth;

iteratively performing the steps to generate a plurality of orphaned trees;

the specific formula of the abnormality index of the data point y is as follows:

where S (y, m) is the outlier score for data point y, E (L (y)) is the expected value of y for path L (y) in multiple trees, and C (m) is the average path length of the orphan tree;

3. The anomaly data reconstruction method based on the isolated forest countermeasure network of claim 1, wherein: the specific formula for carrying out normalization processing on the original load data is as follows:

4. The anomaly data reconstruction method based on the isolated forest countermeasure network of claim 1, wherein: the construction of the generator model and the discriminator model based on the CGAN comprises the following steps:

the CGAN reserves a game structure of the CAN, applies supervised learning to the CAN, and adds a condition value into the input of the generator and the discriminator;

the generator model consists of three layers of GRU networks and one layer of fully connected network;

remodelling the training set data into a three-dimensional matrix with batch size, time step and data dimension, and inputting the three-dimensional matrix into a GRU network;

after the three-dimensional input matrix is operated by 3 layers of GRU networks, matrixes G1, G2 and G3 are sequentially obtained, and the neuron numbers of the 3 layers of GRU networks are sequentially set to 128, 64 and 64;

combining and inputting the random noise z and the condition c into a generator, outputting a generated sample G (z|c) by the generator, and sending the generated sample to a discriminator;

the discriminator consists of a 3-layer CNN and a 1-layer fully connected network;

the front 2 layers of CNNs adopt 32 and 64 5*5 convolution kernels, the step length is set to be 2, the input matrix is sequentially subjected to 2 layers of CNNs feature extraction to obtain C1 and C2 convolution layers, the 3 rd layer of CNNs adopts 16 3*3 convolution kernels, the step length is set to be 1, and the LeakyReLU is selected as an activation function;

the input of the discriminator is the combination of the load data true value t and the condition c and the combination of the generated sample G (z|c) and the condition c, and the discriminator needs to discriminate whether the distribution between the generated sample and the real sample is similar or not and whether the generated sample meets the condition c or not;

the generator and the arbiter update the network parameters, the loss function and the objective function according to the discrimination result.

5. The outlier reconstruction method based on an orphan forest d countermeasure network of claim 4, wherein: the specific formula of the loss function is as follows:

L _G ＝-E _(z,c) [D(G(z|c)|c)]

L _D ＝-E _(t,c) [D(t|c)]+E _(z,c) [D(G(z|c)|c)]

wherein E represents expected values of the corresponding distribution, G (to) represents samples generated by a generator, D (to) represents discrimination of the authenticity of the input samples by the discriminator, D (G (z|c) |c) is a discrimination result of the discriminator on the generated samples G (z|c) under the condition c, D (t|c) is a discrimination result of the discriminator on the real samples t under the condition c,

g (z|c) is a sample generated by the generator, E _(t,c) Is the desire for noise t and condition c;

the specific formula of the CGAN objective function is as follows:

wherein lambda is a gradient penalty coefficient, E represents an expected value of the corresponding distribution, D (to) is a gradient of the discriminator function, E _(t,c) Is the expectation of the noise t and the condition c, and D (G (z|c) |c) is the discrimination result of the discriminator for the generated sample G (z|c) under the condition c.

6. The anomaly data reconstruction method based on the isolated forest countermeasure network of claim 1, wherein: after the to-be-generated model and the arbiter model game reach balance, storing the CGAN model and updating the loss function comprises the following steps:

selecting Wasserstein distance as a loss function of the discriminator, and introducing a gradient penalty mechanism to realize that a real and false sample concentration region and a cross transition region thereof apply Lipschitz constraint;

the Smooth L1 loss function has stronger robustness and stability in the solving process, gradient explosion is effectively avoided, and the network convergence speed is improved;

the cosine similarity function is introduced to measure the difference of the two vectors by calculating the cosine value of the included angle between the vectors, the cosine similarity has higher accuracy on judging the similarity of the vectors, and the track change trend of the vectors is identified.

7. The outlier reconstruction method based on an orphan forest d countermeasure network of claim 6, wherein: the specific formula of the Smooth L1 loss function is as follows:

where t is the true value of the data and G (z|c) is the sample generated by the generator;

the concrete formula of the cosine similarity function is as follows:

wherein G is _i The ith sample, t, generated by the generator _i The true value of the ith sample, n is the total number of samples;

the specific formula of the CGAN loss function is as follows:

wherein lambda is ₁ And lambda (lambda) ₂ The weighting coefficients corresponding to the loss functions are respectively,for lifeObject function, L, achieving balance through game of former model and discriminator model _SL1 Is a Smooth L1 loss function, L _cos As a cosine similarity function.

8. An anomaly data reconstruction system based on an isolated forest countermeasure network, which is based on the anomaly data reconstruction method based on the isolated forest countermeasure network according to any one of claims 1 to 7, and is characterized in that: comprising the steps of (a) a step of,

the isolated forest model training module is used for constructing and training an isolated forest model, and parameter optimization is carried out on the isolated forest model by using the grid search algorithm module;

the abnormal data detection module is used for detecting the abnormality and identifying the data by using the isolated forest model;

the data normalization processing module is used for performing normalization processing on the original load data;

the generator model module generates a sample with authenticity according to the input missing data;

the discriminator model module is used for discriminating whether the generated sample is real data or generated data;

the CGAN model training module is used for training the CGAN model and storing the generator after the generator and the arbiter game reach balance;

and the data filling module is used for inputting the missing data set into the generator model for filling.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that: the processor, when executing the computer program, implements the steps of the outlier reconstruction method based on an orphan forest countermeasure network according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program when executed by a processor implements the steps of the outlier reconstruction method based on an orphan forest countermeasure network according to any of claims 1 to 7.