CN114692729A

CN114692729A - New energy station bad data identification and correction method based on deep learning

Info

Publication number: CN114692729A
Application number: CN202210230736.2A
Authority: CN
Inventors: 陈文进; 陈水耀; 祁炜雯; 张俊; 朱峰; 茹伟; 范强; 宋美雅; 刘震; 刘皓明
Original assignee: Shaoxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Shaoxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-07-01

Abstract

The invention provides a new energy station bad data identification and correction method based on deep learning, which comprises the following steps: acquiring historical operating data of an identification object in the new energy station, and marking historical normal data and historical bad data in the historical operating data; establishing an identification model, and performing deep learning training on the identification model according to historical normal data; establishing a correction model, inputting historical bad data into the correction model and the trained identification model, and performing deep learning training on the correction model by combining the output of the identification model; acquiring real-time operation data of an identification object, and inputting the real-time operation data into a trained identification model to distinguish real-time normal data and real-time bad data in the real-time operation data; and inputting the real-time bad data into the trained correction model to obtain the corrected value of the real-time bad data. The method can obviously improve the efficiency of identifying and correcting the bad data and ensure the real-time safe and stable operation of the new energy power station.

Description

New energy station bad data identification and correction method based on deep learning

Technical Field

The invention belongs to the field of energy data management, and particularly relates to a new energy station bad data identification and correction method based on deep learning.

Background

With the continuous deepening and advancing of new energy station construction, the data acquisition of the new energy station shows the trend of high-volume and high-dimensional, and meanwhile, the problem of poor data is increasingly highlighted. Poor data such as missing, invalid, repeated and wrong often appear in the real-time data collection of new energy station, and poor data usually is caused by two kinds of reasons: firstly, the power system of new energy has faults and the like, such as temporary interruption of a certain data channel in a data acquisition system, which causes unreal data; secondly, due to special events such as sudden accidental fluctuation of some large industrial loads and sudden adverse environments, irregular oscillation of data can occur. The existence of bad data distorts the state estimation result of the new energy station, affects the operation scheduling and stable operation of the power system, and may even cause unknown safety consequences.

The data volume of each type of new energy station is huge, the relationship among the data is various, the station, the unit, the environment and other data are mutually coupled, and the coupling relationship also exists among the internal data. With the continuous development of modern information technology, artificial intelligence is applied to various fields, wherein deep learning is widely applied to new energy pattern recognition, classification and load prediction scenes. The deep learning has strong adaptability to time-varying rule characteristics of time sequences, has memory and association functions on historical information, and can continuously learn massive and coupled data. Therefore, bad data of the new energy station can be identified and corrected by deep learning, and safe and stable operation of the new energy station is guaranteed.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a new energy station bad data identification and correction method based on deep learning, which comprises the following steps:

s100: acquiring historical operating data of an identification object in the new energy station, and marking historical normal data and historical bad data in the historical operating data;

s200: establishing an identification model, and performing deep learning training on the identification model according to historical normal data;

s300: establishing a correction model, inputting historical bad data into the correction model and the trained identification model, and performing deep learning training on the correction model by combining the output of the identification model;

s400: acquiring real-time running data of an identification object, and inputting the real-time running data into a trained identification model to distinguish real-time normal data and real-time bad data in the real-time running data;

s500: and inputting the real-time bad data into the trained correction model to obtain the corrected value of the real-time bad data.

Optionally, the identification object includes a wind power generation parameter, a photovoltaic power generation parameter, and installed capacity and active power of a unit in the new energy station;

the wind power generation parameters comprise wind speed, temperature, wind direction cosine value, humidity and pressure intensity;

the photovoltaic power generation parameters comprise irradiation intensity, irradiation duration and assembly area.

Optionally, the S200 includes:

s210: dividing historical normal data into training data and testing data according to a time sequence, and initializing hyper-parameters of an identification model;

s220: inputting training data into an identification model for training, and calculating a predicted value of an identification object under a corresponding time sequence of test data through the identification model;

s230: calculating whether the convergence accuracy of the identification model meets a preset threshold, wherein the calculation formula of the convergence accuracy is as follows:

where a is convergence accuracy, n is the total number of predicted values, x_fiFor the ith test data, x_iIs the ith predicted value;

s240: if the convergence accuracy meets the preset condition, the training is finished, otherwise, the hyper-parameters of the identification model are adjusted, and S220-S230 are repeated until the convergence accuracy does not exceed the preset condition.

Optionally, the method further includes: preprocessing historical operating data before S200, including:

identifying missing values in historical operating data, acquiring historical operating data belonging to the same class as the missing values, calculating the same-class mean value of the missing values to obtain interpolation values of the missing values, and replacing the missing values with the interpolation values, wherein the calculation formula of the interpolation values is as follows:

wherein, a_iAs the mean coefficient, when the ith input historical operating data s_i0 in the absence, or 1, m is the total amount of data of the historical operating data of the same class,

are interpolation values.

Optionally, the S300 includes:

s310: initializing hyper-parameters of the identification model;

s320: inputting historical bad data into a trained identification model, and taking a predicted value of the historical bad data output by the identification model as an accurate value;

s330: inputting historical bad data into a correction model for training, analyzing the characteristics of the historical bad data through the correction model, and outputting a correction value to the historical bad data according to an analysis result;

s340: and analyzing the error degree and the error dispersion degree of the corrected value relative to the accurate value, finishing training when the analysis result meets a preset condition, otherwise, adjusting the hyperparameter of the corrected model, and repeating S320-S330 until the preset condition is met.

Optionally, the analyzing the degree of error and the degree of error dispersion of the corrected value with respect to the accurate value includes:

analyzing the degree of error by calculating the average absolute error of the corrected value relative to the accurate value;

and analyzing the error dispersion degree by calculating the root mean square difference of the corrected value relative to the accurate value.

Optionally, the method further includes:

while executing S500, selecting normal data with a preset proportion and inputting the normal data into the correction model, and calculating the average absolute error and the root mean square error of the output value of the correction model and the normal data;

and when any one of the average absolute error and the root mean square error does not meet the preset condition, taking the real-time operation data input during the execution of the S400 as training data, and re-training the identification model and the correction model.

Optionally, the identification model is provided with a solver on an output layer, and the solver is a softmax function.

Optionally, the solver outputs the probability of the real-time bad data by comparing the calculated predicted value with the error of the actually measured real-time operation data, and outputs the identified real-time bad data and the time sequence position where the real-time bad data is located according to the probability.

The technical scheme provided by the invention has the beneficial effects that:

the method is combined with deep learning to establish and train the identification model and the correction model in a combined manner, and the obtained model is used for rapidly identifying and correcting the bad data collected in the new energy station in real time, so that the identification and correction efficiency of the bad data can be remarkably improved, the real-time analysis and application of the new energy power station is supported, and the real-time safe and stable operation of the new energy power station is guaranteed.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a new energy station bad data identification and correction method based on deep learning according to an embodiment of the present invention;

fig. 2 is a line graph showing the correlation between the neuron number and the convergence accuracy of the neural network.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in the present invention, "a plurality" means two or more. "and/or" is merely an association relationship describing an associated object, meaning that there may be three relationships, for example, and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "comprises A, B and C" and "comprises A, B, C" means that all three of A, B, C comprise, "comprises A, B or C" means that one of three of A, B, C are comprised, "comprises A, B and/or C" means that any 1 or any 2 or 3 of the three comprise A, B, C are comprised.

It should be understood that in the present invention, "B corresponding to a", "a corresponds to B", or "B corresponds to a" means that B is associated with a, and B can be determined from a. Determining B from a does not mean determining B from a alone, but may be determined from a and/or other information. And the matching of A and B means that the similarity of A and B is greater than or equal to a preset threshold value.

As used herein, "if" may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

As shown in fig. 1, the present embodiment provides a method for identifying and correcting bad data of a new energy station based on deep learning, including:

s100: acquiring historical operating data and real-time operating data of an identification object in the new energy station, and marking historical normal data and historical bad data in the historical operating data;

s400: real-time normal data and real-time bad data in the real-time operation data are distinguished by inputting the real-time operation data into a trained identification model;

In the embodiment, a deep neural network algorithm is used for learning and training historical data collected by a new energy power generation station to obtain a deep neural network identification model meeting the precision requirement; and inputting the data acquired in real time into the identification model to obtain a predicted value of the deep neural network, taking the predicted value as an accurate value, setting a deviation threshold value, comparing the data acquired actually with the accurate value, and if the data exceeds the threshold value range, determining the data as bad data, and finally identifying the acquired bad data in real time. Through the process, the deep learning is utilized, the time-varying rule characteristic adaptability of the time sequence is strong, the memory and association functions are realized on historical information, massive and coupling data can be continuously learned, further, the bad data can be rapidly identified and corrected in real time, the bad data identification and correction efficiency can be remarkably improved, the real-time analysis application of the new energy power station is supported, and the real-time safe and stable operation of the new energy power station is guaranteed.

In this embodiment, the identification object includes a wind power generation parameter, a photovoltaic power generation parameter, and an installed capacity and active power of a unit in the new energy station; the wind power generation parameters comprise wind speed, temperature, wind direction cosine value, humidity and pressure intensity; the photovoltaic power generation parameters comprise irradiation intensity, irradiation duration and assembly area.

In this embodiment, both the historical operating data and the real-time operating data are acquired by the SCADA system, and specifically, the SCADA system performs data acquisition once in 15 minutes. Because the collected data may have an uncontrollable random fault due to the SCADA system, and there are some obvious data errors, which may affect the accuracy of the subsequent training model, the embodiment preprocesses the historical operating data and the real-time operating data before S200, including:

wherein, the first and the second end of the pipe are connected with each other,

to interpolate a value, a_iAs the mean coefficient, when the ith input historical operating data or real-time operating data s_iAnd if the data is missing, the data is 0, otherwise the data is 1, and m is the total data amount of the historical operation data or the real-time operation data of the same class.

In this embodiment, 35040 sample data sets are obtained through data preprocessing, and the time span of the data set is 1 year. And intercepting the last 1000 sample data as a real-time identification data set according to the consistency characteristic of time, using the residual sample data set as historical operating data for constructing and training an identification model and a correction model, and marking historical normal data and historical bad data in the historical operating data according to experience in advance.

Firstly, training a recognition model, wherein the S200 includes:

In this embodiment, first, the hyper-parameters of the identification model are set, including parameters such as the initial network structure, the network threshold, and the weight. The identification problem of the bad data is a classification problem essentially, so that the output layer of the identification model is provided with a solver and adopts a softmax function, and the activation function adopts a sigmoid function. And the solver outputs the probability of real-time bad data by comparing the calculated predicted value with the error of the actually measured real-time operation data, and outputs the identified real-time bad data and the time sequence position of the real-time bad data according to the probability.

Specifically, during the training process of the identification model, the training data is used for predicting the operation data of the identification object in the subsequent time sequence, wherein the subsequent time sequence is consistent with the time sequence corresponding to the test data, and then the identification accuracy of the identification model is judged according to the error between the predicted value and the test data, and when the subsequent execution S400 is carried out, the identification model predicts the subsequent time sequence according to the real-time operation data of the previous time sequence, and compares the predicted result with the real-time operation data of the subsequent time sequence, namely the measured value of the subsequent time sequence of the identification object. The identification model sets up a threshold value which floats up and down through a solver of an output layer, calculates and judges an output accurate value and a corresponding measured value according to a threshold value delta e, and if the output accurate value and the corresponding measured value exceed the threshold value range, the identification model is regarded as bad data. The threshold value Δ e is:

x_maxis the largest measured value, x_minIs the smallest measured value.

Because the number of samples used in this embodiment is large, it is sufficient to adopt a double hidden layer to meet the requirement in consideration of accuracy and processing speed. As shown in fig. 2, the graphs show the correlation between the convergence accuracies of the neuron numbers identified using the defective data corresponding to 5, 10, 15, 20, 25, and 30, respectively, and the horizontal axis represents the neuron number and the vertical axis represents the corresponding convergence accuracy, and the neuron numbers of the respective layers are the same for comparison. In the training process, the simulation time is almost multiplied with the increase of the number of neurons, the accuracy of the identification model is gradually increased, and after the number of nodes of the hidden layer of the identification model is respectively 20 and 20, the identification result is not obviously improved, because the depth model is always increased progressively and tends to be stable, which shows that the performance of the identification model is gradually optimized with the increase of the hidden layer. According to the relation of comprehensively considering model time and performance, the number of the hidden layer nodes of the identification model selected in the text is 20, and the finally obtained bad data result is shown in table 1. As can be seen from table 1, the model accuracy of each type of data reaches more than 97%, which indicates that the accuracy and convergence of identifying different types of bad data by using the identification model are high.

TABLE 1

After the training of the identification model is completed, the correction model is trained again, the correction model in the embodiment is a BP neural network, a trainbr algorithm is adopted, and the algorithm has better functional capability and higher convergence rate than a basic gradient algorithm, and is more suitable for a data set coupled with each other.

Specifically, the S300 includes:

s310: initializing hyper-parameters of the identification model, namely determining a neural network structure, a network threshold value and a weight value;

s320: after normalization processing is carried out on historical bad data, inputting the normalized historical bad data into a trained identification model, and taking a predicted value of the historical bad data output by the identification model as an accurate value, wherein specifically, the historical bad data is subjected to data normalization processing in Matlab by a mapminmax function;

s330: inputting historical bad data into a correction model, analyzing the characteristics of the historical bad data through the correction model, and outputting a correction value to the historical bad data according to an analysis result;

s340: and analyzing the error degree and the error dispersion degree of the corrected value relative to the accurate value, finishing training when the analysis result meets the preset condition, otherwise, adjusting the hyperparameter of the corrected model, namely adjusting the neural network structure, the network threshold value and the weight, and repeating S320-S330 until the preset condition is met.

In this embodiment, first, the hyper-parameters of the identification model are set, including parameters such as the initial network structure, the network threshold, and the weight. For the selection of the number of hidden layer nodes and the number of hidden layers of the neural network, if the number of the hidden layer nodes is too small, the network cannot have necessary learning capacity and information processing capacity, and conversely, if the number of the hidden layer nodes is too large, the network is more complex and the processing speed is slower, and the network is more prone to fall into local minimum points in the learning process. Therefore, in the embodiment, when the output of the identification model is multiple, the identification and modification effect of the bad data is better by adopting the hidden layers with more than 2 layers, and the number of nodes and other parameters of each layer are obtained in the training.

The degree of error and the degree of error dispersion of the analysis correction value with respect to the accurate value include:

analyzing the error degree by calculating the Mean Absolute Error (MAE) of the corrected value relative to the accurate value, wherein the specific calculation formula is as follows:

y_iis the ith correction value, y_tiIs the ith accurate value, and l is the total number of the corrected values.

Analyzing the error dispersion degree by calculating the root mean square difference (RMSE) of the corrected value relative to the accurate value, wherein the specific calculation formula is as follows:

specifically, in the training process of the correction model, the error degree and the error dispersion degree of the correction value relative to the accurate value are shown in table 2, the values of the correction evaluation indexes RMSE and MAE are very small and are relatively close to the actual values, which shows that the method has a good correction effect on various types of bad data of the new energy station, and both meet the preset condition, so that the training of the correction model is finished.

TABLE 2

Data type	RMSE	MAE
			Active power	0.47706	0.07211
Wind speed	0.51107	0.05084
			Wind direction cosine value	0.17631	0.13896
Temperature of	0.08850	0.00889
			Pressure intensity	0.35447	0.05220
Humidity	0.41174	0.03381

In order to cope with the random change of the operation condition of the new energy station, the embodiment further includes: while executing S500, selecting normal data with a preset proportion and inputting the normal data into the correction model, and calculating the average absolute error and the root mean square error of the output value of the correction model and the normal data; and when any one of the average absolute error and the root mean square error does not meet the preset condition, taking the real-time operation data input during the execution of the S400 as training data, and re-training the identification model and the correction model. The hyper-parameters of the identification model and the correction model are adjusted in time through the process so as to meet the accuracy optimization of the identification model and the correction model.

In this embodiment, the real-time normal data is not changed, and when the real-time bad data is corrected, a part of the real-time normal data is also input into the trained correction model to be corrected, the corrected related error indexes are counted to obtain a corrected related error index comparison, if both the correction evaluation indexes RMSE and MAE meet the preset condition, the accuracy of the current model meets the requirement, and the hyper-parameters of the identification model and the correction model do not need to be adjusted temporarily.

The sequence numbers in the above embodiments are merely for description, and do not represent the sequence of the assembly or the use of the components.

The above description is intended to be illustrative of the present invention and should not be taken as limiting the invention, as the invention is intended to cover various modifications, equivalents, improvements, and equivalents, which may be made within the spirit and scope of the present invention.

Claims

1. Poor data identification and correction method of new energy station based on deep learning is characterized by comprising the following steps:

2. The method for identifying and correcting the bad data of the new energy station based on the deep learning as claimed in claim 1, wherein the identification objects comprise wind power generation parameters, photovoltaic power generation parameters and installed capacity and active power of units in the new energy station;

3. The method for identifying and correcting the bad data of the new energy station based on the deep learning as claimed in claim 1, wherein the method further comprises: preprocessing historical operating data and real-time operating data before S200, including:

identifying missing values in historical operating data, acquiring historical operating data which belongs to the same class as the missing values, calculating the same-class mean value of the missing values to obtain interpolation values of the missing values, and replacing the missing values with the interpolation values, wherein the interpolation values have the calculation formula:

wherein the content of the first and second substances,

4. The method according to claim 1, wherein the S200 includes:

5. The method for identifying and correcting the bad data of the new energy station based on the deep learning of claim 1, wherein the step S300 comprises:

s310: initializing hyper-parameters of the identification model;

s320: after normalization processing is carried out on historical bad data, inputting the historical bad data into a trained identification model, and taking a predicted value of the historical bad data output by the identification model as an accurate value;

6. The method for identifying and correcting the bad data of the new energy station based on the deep learning of claim 5, wherein analyzing the degree of error and the degree of error dispersion of the corrected value with respect to the accurate value comprises:

7. The method for identifying and correcting the bad data of the new energy station based on the deep learning as claimed in claim 1, wherein the method further comprises:

8. The method for identifying and correcting the bad data of the new energy station based on the deep learning as claimed in claim 1, wherein the identification model is provided with a solver at an output layer, and the solver is a softmax function.

9. The method for identifying and correcting the bad data of the new energy station based on the deep learning of claim 8, wherein the solver outputs a probability of the bad data in real time by comparing the calculated predicted value with an actually measured error of the real-time operation data, and outputs the identified bad data in real time and a time sequence position where the bad data is located according to the probability.