CN108776683B

CN108776683B - Electric power operation and maintenance data cleaning method based on isolated forest algorithm and neural network

Info

Publication number: CN108776683B
Application number: CN201810559071.3A
Authority: CN
Inventors: 李星南; 曾瑛; 蔡毅; 李伟坚; 施展; 亢中苗
Original assignee: Guangdong Power Grid Co Ltd; Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Priority date: 2018-06-01
Filing date: 2018-06-01
Publication date: 2022-01-21
Anticipated expiration: 2038-06-01
Also published as: CN108776683A

Abstract

The invention provides a method for cleaning power communication operation and maintenance data, and more particularly relates to a method for cleaning power communication operation and maintenance data based on an isolated forest algorithm and a neural network, which comprises the following steps: firstly, constructing an isolated forest model iForest for solving a target problem by utilizing an improved isolated forest algorithm; then defining an evaluation system of the isolated forest algorithm to the abnormal data; and predicting and correcting the abnormal data attribute detected by the isolated forest by training a BP neural network. The method is optimized for the electric power communication operation and maintenance data cleaning method based on the isolated forest algorithm and the neural network, improves the abnormal detection accuracy and reduces the data correction error, and effectively optimizes the electric power operation and maintenance data cleaning program in the aspects of abnormal data positioning accuracy, data correction accuracy, training time, resource occupation and the like.

Description

Electric power operation and maintenance data cleaning method based on isolated forest algorithm and neural network

Technical Field

The invention provides a method for cleaning power communication operation and maintenance data, and particularly relates to a method for cleaning power communication operation and maintenance data based on an isolated forest algorithm and a neural network.

Background

With the rapid development of power communication networks, the quantity of power operation and maintenance data is larger and larger, and the requirement of a power department on data reliability is higher and higher. In the process of transmitting and storing the electric power operation and maintenance data, the problems of bad data such as noise, data loss, data error and the like can not be avoided under the influence of external interference, transmission error and the like; the power data contains multidimensional attributes and is acquired by different devices respectively, and challenges are provided for abnormal detection of the data. The traditional data correction methods such as mean value calculation, regression analysis and the like cannot accurately learn the characteristics and rules of the whole data set, and particularly under the condition of high data dimensionality, the data correction error is large. At present, data cleaning mainly comprises consistency inspection, error value, missing value, invalid value processing and other mechanisms, and an artificial neural network algorithm can be adopted to improve data quality. Patent 201610370415.7 discloses a data cleansing method for RFID data, which filters data with coding errors by a hardware EPC (Electronic product code) filter, thereby achieving cleansing of the repeated data. However, the method does not correct missing values and invalid values, and is not suitable for processing large-scale electric power operation and maintenance data with complex attributes due to limited hardware processing capacity; the patent 201510129479.3 performs data cleaning based on an ETL mechanism in a data warehouse, and has the advantages of large cleaning range and high algorithm execution efficiency. However, the power operation and maintenance data contain multidimensional attributes, and the data volume and scale are large and the attributes are complex, so that the scheme still has defects in the aspects of cleaning precision, data quality and the like. The efficient data cleaning method is selected to provide important support for analysis and mining of the power operation and maintenance data, and the method has important significance for improving the comprehensive benefits of the power operation and maintenance.

Disclosure of Invention

The invention provides the electric power operation and maintenance data cleaning method based on the isolated forest algorithm and the neural network for overcoming at least one defect in the prior art, improves the branching step of the isolated forest algorithm, improves the efficiency and the accuracy of the isolated forest model, enables the learning rate to be adaptively adjusted along with the change trend of the network, and improves the performance of the BP neural network. The method is effectively optimized in the aspects of abnormal data positioning accuracy, data correction accuracy, training time, resource occupation and the like.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a power operation and maintenance data cleaning method based on an isolated forest algorithm and a neural network is characterized by comprising the following steps:

s1, constructing an isolated forest model iForest for solving a target problem by using an improved isolated forest algorithm;

s2, defining an evaluation system of the isolated forest algorithm to the abnormal data;

and S3, training a learning rate self-adaptive BP neural network to predict and correct the abnormal data attribute detected by the isolated forest.

Preferably, the step S1 specifically includes the following steps:

s11, at the beginning of the method, firstly, grouping the attributes;

s12, randomly selecting psi sample data points from the training data set as a sub-sampling set, constructing an initial iTree, and putting the sub-sampling set into a root node of the tree; psi is the number of randomly selected sample data points;

s13, randomly appointing an attribute group of the data item, and selecting a division cutting point in the current node data;

s14, generating a hyperplane by the cutting point, dividing the data space of the current node into two subspaces, and dividing the data items;

s15, recursively constructing new child nodes until only one data item in the child nodes (unable to continue cutting) or the iTree has reached the initially defined limit height.

Preferably, the step S2 specifically includes:

s21, selecting test data x, and substituting the test data x into each iTree in the forest; x represents test data;

s22, calculating the depth h (x) of each tree and calculating the average value E (h (x)) of all h (x); where h (x) represents the depth at which the test data point falls on each tree; e (h (x)) represents the average of all h (x);

s23, setting the standard average search length c (ψ) according to equation (1):

c (psi) ═ 2H (psi-1) - (2 (psi-1)/psi) formula (1)

Wherein H (i) is calculated according to equation (2):

h (i) ═ ln (i) + Ec formula (2)

Ec is an euler constant with a value of 0.5772; c (ψ) represents the standard average search length of iTree;

s24, defining the abnormal score S (x, psi) of the data to be measured according to the formula (3):

s (x, ψ) represents the abnormality score of the data to be measured, and the closer the abnormality score value is to 1, the greater the possibility that the data is abnormal data is indicated.

Preferably, the step S3 specifically includes:

s31, randomly selecting a small batch of data samples in the data set, namely a combination of the input vector and the output expected value, and substituting the combination into the neural network;

s32, carrying out forward propagation process layer by layer, and calculating the activation value of each layer of the neural network according to the formula (4) and the formula (5):

wherein W represents a weight parameter in the BP neural network,

representing a weight parameter between the jth unit of the ith layer and the ith unit of the (l + 1) th layer; b: a threshold parameter in the BP neural network,

represents the bias of the ith unit of the l +1 th layer; f represents an activation function, an ELU (explicit Linear units) function is adopted here, the advantages are that the calculation is simple and convenient, the problem of gradient disappearance caused by the subsequent error gradient calculation can be prevented, mu is an amplitude parameter of the ELU function, the adjustment can be flexibly carried out in the actual operation, generally (0,1),

the activation value of the ith unit of the ith layer is calculated layer by layer in such a way until the output value h of the neural network is obtained_W，b（x)；

S33, calculating the error between the expected value and the actual output according to the formula (6):

wherein h is_W，b(x) Representing an output value obtained by the neural network through forward propagation, y representing an expected value, W and b representing a weight matrix and a threshold matrix respectively, and J representing an error;

s34, calculating the whole cost function according to the formula (7), ending if the function converges to the global minimum value, otherwise, turning to S35;

wherein, L represents the overall cost function of the neural network, and m represents the number of samples;

s35, performing a back propagation process, wherein the back propagation process is to adjust parameters of each layer of the neural network through a gradient descent algorithm, continuously reduce the cost function, firstly calculate the error of each neuron, and calculate the error gradient according to the formula (8):

wherein,

representing cost function versus weight parameter

The error gradient of (2) is deduced from the output layer by layer forwards through a chain type derivation method,

the derivation relation is given by the formula (4), which is not repeated, and the method for calculating the error gradient of the threshold parameter is the same;

s36, judging the gradient change trend, adaptively adjusting the learning rate of the neural network, if the two adjacent gradients are adjusted to be in the same direction, increasing the learning rate according to the formula (9), and if the two adjacent gradients are adjusted to be in opposite directions, indicating that the gradient change fluctuation is large, and decreasing the learning rate according to the formula (10):

wherein alpha is_k+1Representing the learning rate of the neural network at the time k +1, for controlling the rate of gradient change, alpha, during the back propagation of the neural network_kRepresents the learning rate of the neural network at time k,

and

respectively representing gradient values calculated at k moment and k-1 moment, and introducing a momentum factor eta with the value of (0,1) as a damping term of gradient change to reduce the gradient change difference caused by two adjacent momentsThe self-adaptive change of the learning rate is safer and more stable due to the large oscillation;

s37, the weight parameter and the threshold parameter are updated according to the gradient descent algorithms of equations (11) and (12), and α represents the current learning rate, and then returns to S31.

Compared with the prior art, the beneficial effects are:

(1) in the abnormal data detection stage, considering the correlation among the attributes of the power metadata, the algorithm firstly improves the construction mode of an isolated Tree (Isolation Tree) in an isolated forest model, so that the correlation of the attributes is more sensitive, the branching step of the isolated forest algorithm is improved, and the efficiency and the accuracy of the isolated forest model are improved.

(2) In the stage of predicting and correcting data, the algorithm automatically adjusts the learning rate according to the trend of gradient change, so that the learning rate is continuously adjusted to the most appropriate value, the stability of the gradient change is ensured, the convergence speed is greatly increased, the network overhead is reduced, the problem that the convergence is too slow in the later stage of the traditional BP neural network algorithm training is solved, and the convergence curve of the network is more stable.

The method constructs an isolated forest to extract the characteristics of a training data set, detects abnormal data in the data set, and then uses an improved BP neural network model to predict and modify the abnormal data. The electric power operation and maintenance data cleaning program based on the improved scheme is effectively optimized in the aspects of abnormal data positioning accuracy, data correction accuracy, training time and the like.

Drawings

FIG. 1 is a flow chart of a method for cleaning power operation and maintenance data based on an isolated forest algorithm and a neural network.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent.

As shown in fig. 1, a method for cleaning power operation and maintenance data based on an isolated forest algorithm and a neural network is characterized by comprising the following steps:

The step S1 specifically includes the following steps:

s11, at the beginning of the method, firstly, grouping the attributes;

And substituting the iTree branch method designed according to the algorithm into data for training until all iTrees in iForest are constructed.

The step S2 specifically includes:

c (psi) ═ 2H (psi-1) - (2 (psi-1)/psi) formula (1)

Wherein H (i) is calculated according to equation (2):

h (i) ═ ln (i) + Ec formula (2)

In the step S3, training the training set until the overall cost function of the neural network converges, taking attributes (T, AP, RH, V) as input vectors and electric energy output EP as output values; the method specifically comprises the following steps:

wherein W represents BP nerveThe weight parameters in the network are used to determine,

wherein,

representing cost function versus weight parameter

and

the gradient values calculated at the k moment and the k-1 moment are respectively represented, and in addition, a momentum factor eta is introduced, the value of the momentum factor eta is (0,1), and the momentum factor eta is used as a damping term of gradient change and is used for reducing oscillation caused by overlarge gradient change difference between two adjacent moments, so that the self-adaptive change of the learning rate is safer and more stable;

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A power communication operation and maintenance data cleaning method based on an isolated forest algorithm and a neural network is characterized by comprising the following steps:

the specific steps of S1 include the following:

s11, at the beginning of the method, firstly, grouping the attributes;

s15, recursively constructing a new child node until the child node only has one data item, namely the cutting cannot be continued; or the iTree has reached an initially defined height;

s3, training a learning rate self-adaptive BP neural network to predict and correct the abnormal data attribute detected by the isolated forest;

the S3 specifically includes:

wherein W represents a weight parameter in the BP neural network,

represents the bias of the ith unit of the l +1 th layer; f represents an activation function, the value range of mu is (0,1),

the activation value of the ith unit of the ith layer is represented, and the activation value is calculated layer by layer until the output value h of the neural network is obtained_W,b(x)；

wherein h is_W,b(x) Representing an output value obtained by the neural network through forward propagation, y representing an expected value, W and b representing a weight matrix and a threshold matrix respectively, and J representing an error;

wherein,

representing cost function versus weight parameter

The error gradient of (a);

and

respectively representing gradient values calculated at the k moment and the k-1 moment, and introducing a momentum factor eta, wherein the value is (0, 1);

s37, updating the weight parameter and the threshold parameter according to the gradient descent algorithm of the equation (11) and the equation (12), where alpha represents the current learning rate, and then returning to S31,

2. the method for cleaning operation and maintenance data of power communication based on an isolated forest algorithm and a neural network as claimed in claim 1, wherein the S2 specifically comprises:

c (psi) ═ 2H (psi-1) - (2 (psi-1)/psi) formula (1)

Wherein H (i) is calculated according to equation (2):

h (i) ═ ln (i) + Ec formula (2)