CN115965135A

CN115965135A - New energy prediction error modeling method and system based on naive Bayes classification

Info

Publication number: CN115965135A
Application number: CN202211628344.8A
Authority: CN
Inventors: 段乃欣; 张小奇; 葛鹏江; 陈宇轩; 吕金历; 张耀; 李欣; 张小东; 江国琪; 向异
Original assignee: Northwest Branch Of State Grid Corp Of China; Xian Jiaotong University
Current assignee: Northwest Branch Of State Grid Corp Of China; Xian Jiaotong University
Priority date: 2022-12-17
Filing date: 2022-12-17
Publication date: 2023-04-14

Abstract

The invention discloses a new energy prediction error modeling method and system based on naive Bayes classification, comprising the following steps: acquiring actual output data and predicted data of the new energy to obtain predicted error data of the new energy; performing probability density distribution fitting based on a kernel density estimation method to obtain probability density distribution curves of the three data; discretizing three types of new energy data by using a self-organizing mapping (SOM) neural network; the method comprises the steps that a model training method based on cross validation is adopted, a naive Bayes classifier model is used, and a new energy prediction error classification model is constructed by using new energy actual output data, new energy prediction data and new energy prediction error data which are subjected to discrete processing; constructing a mapping relation of error evaluation; and (4) carrying out example analysis and probability evaluation by combining new energy data cleaning and an error classification model based on naive Bayes. The modeling method provided by the invention can improve the reliability of new energy prediction and can be used as a basis for scheduling decision of the power system.

Description

New energy prediction error modeling method and system based on naive Bayes classification

Technical Field

The invention belongs to the technical field of new energy, and relates to a new energy prediction error modeling method based on naive Bayes classification.

Background

The method has important significance for realizing safe and effective grid connection of large-scale new energy power generation, reducing adverse effects brought by new energy grid connection and accurately predicting the power generation power of the wind power generation field and the photovoltaic power generation field.

The new energy prediction has an important influence on the safe and stable operation of the power system, but due to the accuracy, time scale and resolution of numerical meteorological data and other factors (such as a prediction method and the like), the current new energy prediction precision is low, and the application requirement is difficult to meet. In this context, new energy prediction error analysis modeling will help to describe the uncertainty of new energy prediction more accurately. In addition to providing point prediction information of new energy power to the scheduling department, probability distribution information of predicted values can be provided so that the scheduling department takes it into consideration in scheduling decisions. Research on operation risk assessment and risk decision problems of the power system containing new energy resources is increasing day by day, and the problems also depend on the grasp of the prediction deviation information of the new energy resources.

Disclosure of Invention

Based on the problems, the method deeply analyzes the problems existing in the new energy prediction and the new energy prediction error composition distribution, constructs a model by combining the self characteristics of the northwest new energy, the autocorrelation and the cross correlation between the new energy prediction and the actual output data, and provides a modeling method of the new energy prediction error based on a naive Bayes classifier. The invention aims at new energy consumption, provides a new energy prediction error modeling method based on naive Bayes classification to improve the reliability of new energy prediction, and can be used as a basis for scheduling decision of an electric power system.

In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:

the new energy prediction error modeling method based on naive Bayes classification comprises the following steps:

acquiring actual new energy output data and new energy prediction data, performing data cleaning on the actual new energy output data and the new energy prediction data, and removing abnormal values to obtain new energy prediction error data;

based on a kernel density estimation method, carrying out probability density distribution fitting on the actual output data of the new energy, the predicted data of the new energy and the predicted error data of the new energy to obtain probability density distribution curves of the three data; discretizing three types of new energy data by using a self-organizing mapping neural network (SOM), taking actual output data of new energy, new energy prediction data and new energy prediction error data as input, training by competitive neurons to obtain the clustering number and classification boundary of the three types of data, and discretizing the data;

the method comprises the steps that a model training method based on cross validation is adopted, a naive Bayes classifier model is used, and a new energy prediction error classification model is constructed by using new energy actual output data, new energy prediction data and new energy prediction error data which are subjected to discrete processing; constructing a mapping relation from the actual output data of the new energy, the predicted data of the new energy and the generated energy data to the predicted error evaluation of the new energy in the day ahead;

and (4) performing example analysis by combining new energy data cleaning and a naive Bayes-based error classification model according to the actual output data and the new energy prediction data of the new energy, and performing probability evaluation on the new energy prediction error on a new test set.

As a further improvement of the method, the actual new energy output data and the new energy forecast data are obtained, the actual new energy output data and the new energy forecast data are subjected to data cleaning, abnormal values are removed, and new energy forecast error data are obtained; the method comprises the following steps:

acquiring annual actual output data and new energy prediction data of new energy;

removing abnormal values in the new energy power generation data, wherein the abnormal values refer to data samples with sizes deviating from the normal fluctuation range of the data for a long time;

for data samples similar to load data, decomposing the data samples into trend terms, seasonal terms and residual terms by adopting STL decomposition of a time series; the STL decomposition formula is as follows:

Y＝T+S+R

in the formula, Y represents an original sequence, T represents a trend item, S represents a season item, and R represents a residual error item;

and analyzing residual items by using a prediction model S-ARIMA autoregressive fitting, wherein the ARIMA autoregressive fitting expression is as follows:

y _t ＝c+φ ₁ y _t-1 +...+φ _p y _t-p +ε _t +θ ₁ ε _t-1 +...+θ _q ε _t-q

wherein, p is the autoregressive order, q is the moving average order, and d represents the differential order;

and (3) adopting a hybrid anomaly identification model based on a Bagging integrated learning framework, using a basic model isolated forest as a base classifier, training the basic base classifier model for multiple times, randomly extracting each training sample, and finally taking the average value of the results of each learner as a final result to serve as new energy prediction error data.

As a further improvement of the present invention, in the hybrid anomaly identification model based on the Bagging ensemble learning framework, an expression of the Bagging ensemble model is as follows:

wherein N is the number of constructed base models, g (x, alpha) is a single classification model, and alpha is a model parameter;

constructing an isolated forest model as a base classifier, wherein the basic idea of the isolated forest is similar to multi-dimensional hyperplane segmentation, initially possessing a sample set S, selecting a random hyperplane to segment a data set to generate two subspaces, then randomly selecting the hyperplane to segment the two subspaces, and repeating until each subspace only contains one sample point; so far each sample point will correspond to a division number.

As a further improvement of the invention, the construction of the isolated forest model is divided into two stages of training and integration, including:

randomly extracting psi points from an original sample set S to form a root node subsample set;

randomly selecting a dimension omega, and randomly selecting a cutting point p in the dimension data range;

generating a hyperplane by using a cutting point p, dividing a training sample into two subspaces, and dividing the samples smaller than p and larger than p under the selected dimensionality omega in the subspaces into two types to respectively form left and right branches of the node;

and repeating the step of randomly selecting the dimension omega and randomly selecting a cutting point p in the dimension data range, and continuously generating new leaf nodes until only one sample point is contained under the branch of the new child node.

As a further improvement of the method, the method is based on a kernel density estimation method, and performs probability density distribution fitting on the new energy actual output data, the new energy prediction data and the new energy prediction error data to obtain probability density distribution curves of the three data; discretizing three types of new energy data by using a self-organizing mapping neural network (SOM), taking actual output data of new energy, new energy prediction data and new energy prediction error data as input, training by competitive neurons to obtain the clustering number and classification boundary of the three types of data, and discretizing the data; the method comprises the following steps:

the kernel density estimation is that a Gaussian kernel function is placed at each data sample position, and then all kernel functions are summed to obtain a smooth probability density function;

selecting a kernel density estimation bandwidth h, which has a large influence on an obtained estimation result; and performing kernel density estimation by using a Gaussian kernel function, wherein an empirical estimation formula of the optimal bandwidth h is as follows:

acquiring a probability density distribution curve by a kernel density estimation method by using the actual output data of the new energy, the prediction data of the new energy and the prediction error data of the new energy which are cleaned by the abnormal data;

clustering new energy data by adopting a self-organizing mapping neural network (SOM); the self-organizing mapping neural network generates a low-dimensional discrete mapping by learning data in an input space; a competitive learning strategy is applied, and the network is gradually optimized by means of mutual competition among neurons; maintaining a topology of an input space using a nearest neighbor relation function; adjacent samples in the input space are mapped to adjacent output neurons; d dimension of input space, input mode is x = { x = { [ x ] _i I =1, 2.. Multidot.d }, the connection weight between the input unit i and the neuron j at the computation level is ω = { ω = ω } _i,j J =1, 2.. N, i =1, 2..., D }, where N is the total number of neurons;

and discretizing the new energy output data, the new energy prediction data and the new energy prediction error data according to the self-organizing mapping neural network clustering result.

As a further improvement of the invention, the self-organizing map neural network training process is as follows:

initializing;

the neuron calculates the respective discriminant function value of each input mode, and declares the specific neuron with the smallest discriminant function value as the winner, wherein the discriminant function of each neuron j is:

the winning neuron I (x) determines the spatial position of the topological neighborhood of the excitatory neuron; after determining the activated node I (x), updating the nodes adjacent to the activated node I (x); the calculation formula of the update degree is as follows:

wherein S is _i,j Represents the distance of neurons i and j, σ (t) decays over time;

appropriately adjusting the connection weights of the relevant excitatory neurons such that the winning neuron has an enhanced response to subsequent applications of similar input patterns;

continuing iteration until the feature mapping tends to be stable; after the iteration is finished, the neuron activated by each sample is the corresponding category;

and further obtaining a result of discretizing the actual output data of the new energy, the prediction data of the new energy and the prediction error data of the new energy according to the clustering result of the self-organizing mapping neural network.

As a further improvement of the invention, the model training method based on cross validation uses a naive Bayes classifier model, and utilizes the discrete processed new energy actual output data, new energy prediction data and new energy prediction error data to construct a new energy prediction error classification model; the method comprises the following steps of constructing a mapping relation from the current new energy actual output data, the new energy prediction data and the generated energy data to the new energy prediction error evaluation by using the discretized new energy actual output data, the new energy prediction error data and the new energy generated energy data and combining the new energy self characteristics, the autocorrelation and the cross correlation between the new energy prediction and the actual output data, and comprises the following steps of:

modeling energy prediction error data by adopting a naive Bayes classifier, wherein the input of naive Bayes must be discrete data, so that the three data are discretized according to the actual output clustering result of new energy, the clustering result of new energy prediction data and the classification and division result of prediction error; the predicted data of previous day is used as the first input x ₁ The actual force of the previous two days is taken as the second input x ₂ Using the current day prediction as input x ₃ Finally, the average generated energy of the previous two days is taken as x ₄ Taking the prediction error as the ylabel, and training a naive Bayes classifier;

constructing a basic naive Bayes probability model:

p(C|F ₁ ,...,F _n )

modifying the model according to a Bayesian formula to obtain the following formula:

wherein p (C) is prior probability, p (C | F) ₁ ,...,F _n ) Is the posterior probability; sample characteristics F ₁ ＝f ₁ ,...,F _n ＝f _n ；

A joint distribution model:

p(C|F ₁ ,...,F _n )∝p(C)*p(F ₁ ,...,F _n |C)

∝p(C)*p(F ₁ |C)*p(F ₂ ,...,F _n |C,F ₁ )

∝p(C)*p(F ₁ |C)*p(F ₂ |C,F ₁ )*...*p(F _n |C,F ₁ ,...,F _n-1 )

according to a naive bayes classifier definition: it is assumed that each feature is independent of the others, i.e. features are independent of each other, and thus:

p(F _i |C,F _j )＝p(F _i |C)

the conditional probability distribution of the variable C is expressed as:

wherein Z is dependent only on F ₁ ,...,F _n Z is a constant when the characteristic variable is known;

constructing a classifier from a probability model, wherein the naive Bayes classifier comprises the model and a corresponding decision rule; the decision rule is the label with the maximum probability selected by adopting the maximum posterior probability decision rule idea, and the naive Bayes classifier is defined as follows:

as a further improvement of the present invention, the developing an example analysis according to the actual output data of the new energy and the prediction data of the new energy, in combination with the new energy data cleaning and the error classification model based on naive bayes, and performing probability evaluation on the prediction error of the new energy on a new test set includes:

obtaining probability density distribution based on the actual output data of the new energy and the characteristics of the predicted data of the new energy;

clustering data by using a self-organizing mapping neural network clustering method according to the probability density distribution, carrying out inverse discretization division on the data by using a clustering result, and comparing the clustering result with the data probability density distribution;

obtaining unbalanced distribution of three types of data from the new energy prediction error probability distribution diagram;

and comprehensively analyzing the comprehensive evaluation index of the classification result of the new energy prediction error, and modeling the new energy prediction error through comprehensively analyzing the new energy data characteristic and the distribution condition thereof to obtain the probability information modeling of the new energy prediction error.

As a further improvement of the invention, the data is clustered by using a self-organizing mapping neural network clustering method according to the probability density distribution, the data and the retrograde discretization are divided by using a clustering result, and the clustering result is compared with the data probability density distribution; the method comprises the following steps:

and according to the probability density distribution, clustering the data by using a self-organizing mapping neural network clustering method, carrying out inverse discretization division on the data by using a clustering result, and comparing the clustering result with the data probability density distribution. The self-organizing mapping neural network divides the actual output data of the new energy into 5 types;

the first stage is from 0 to 15000 and corresponds to a first grade climbing stage in a probability distribution curve;

the second level is from 15000 to 25000, corresponding to the first peak in the probability distribution curve;

the third level is from 25000 to 36000 and corresponds to a stage of slowly descending after the peak value in the probability distribution curve;

the fourth level is from 36000 to 48000, corresponding to the second peak in the probability distribution curve;

the last stage is from 48000 to 74000, corresponding to the tailing of the probability distribution curve;

the self-organizing mapping neural network divides the new energy prediction data into 5 types:

the third level is from 25000 to 37000, corresponding to the stage of slow decline after the peak value in the probability distribution curve;

the fourth stage is from 36000 to 49000, corresponding to the second peak in the probability distribution curve;

the last stage corresponds to the tail of the probability distribution curve from 48000 to 100000.

A new energy prediction error modeling system based on naive Bayes classification comprises:

the data cleaning module is used for acquiring actual new energy output data and new energy prediction data, cleaning the actual new energy output data and the new energy prediction data, removing abnormal values and obtaining new energy prediction error data;

the discretization module is used for performing probability density distribution fitting on the actual output data of the new energy, the new energy prediction data and the new energy prediction error data based on a nuclear density estimation method to obtain probability density distribution curves of the three data; discretizing three types of new energy data by using a self-organizing mapping neural network (SOM), taking actual output data of new energy, new energy prediction data and new energy prediction error data as input, training by competitive neurons to obtain the clustering number and classification boundary of the three types of data, and discretizing the data;

the relation mapping module is used for constructing a new energy prediction error classification model by using a naive Bayes classifier model and utilizing the new energy actual output data, the new energy prediction data and the new energy prediction error data which are subjected to discrete processing based on a cross validation model training method; constructing a mapping relation from the actual output data of the new energy, the prediction data of the new energy and the generated energy data to the prediction error evaluation of the new energy at the present day;

and the probability evaluation module is used for performing probability evaluation on the new energy prediction error on a new test set according to the actual output data of the new energy and the new energy prediction data by combining new energy data cleaning and naive Bayes-based error classification model expansion example analysis.

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses a modeling method of a new energy prediction error, which is based on a naive Bayes classifier and is based on the self characteristics of new energy in northwest, autocorrelation and cross correlation between new energy prediction and actual output data aiming at the problems existing in the current new energy prediction from the aspects of power system scheduling decision and new energy consumption. The problems of difficult prediction, low prediction precision and the like caused by strong randomness and large volatility of new energy are solved, a new energy prediction error modeling method based on a naive Bayes classifier is designed and provided for the first time, probability information modeling of new energy prediction errors is realized, new energy consumption is promoted, and guarantee is provided for power system scheduling, and the method has the following specific advantages:

(1) The invention provides a new energy actual output data, a new energy prediction data grade division and discretization method based on the specific data characteristics of new energy data, and selects new energy data in northwest of China to perform example analysis. The example calculation result shows that the clustering method based on the self-organizing mapping neural network well describes the probability density distribution conditions of the actual output data and the predicted data of the new energy, and the climbing, peak and descending stages of the probability density curve are highlighted. The grading precision of the new energy data is effectively improved.

(2) The invention provides a naive Bayes new energy prediction error classification model based on a cross validation training mode. By modeling the self characteristics of the new energy in northwest, the autocorrelation and the cross correlation between the new energy prediction and the actual output data, the mapping relation from the new energy actual output data, the new energy prediction data and the generated energy data to the new energy prediction error evaluation in the future is constructed, and the new energy prediction error modeling precision is greatly improved.

(3) The method comprehensively analyzes the data characteristics and the distribution condition of the new energy in northwest, models the prediction error of the new energy, improves the prediction reliability of the new energy, provides a basis for the scheduling planning of the power system, and promotes the new energy consumption and the safe and stable operation of the power grid.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a flow chart of a new energy prediction error modeling method based on naive Bayes classification;

FIG. 2 is a schematic illustration of a nuclear density estimation method;

FIG. 3 is a probability density distribution diagram of new energy actual output data;

FIG. 4 is a probability density distribution diagram of new energy prediction data;

FIG. 5 is a new energy prediction error probability density distribution diagram;

FIG. 6 is a diagram of a self-organizing map neural network architecture;

FIG. 7 is a new energy prediction error classification confusion matrix diagram;

FIG. 8 is a diagram of a new energy prediction error modeling system based on naive Bayes classification.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention aims to design a modeling method of a new energy prediction error, and provides a new energy prediction error modeling method based on a naive Bayes classifier by combining northwest new energy self-characteristics, autocorrelation and cross-correlation between new energy prediction and actual output data aiming at the problems existing in the current new energy prediction from the aspects of power system scheduling decision and new energy consumption. The problems of difficult prediction, low prediction precision and the like caused by strong randomness and large volatility of new energy are solved, a new energy prediction error modeling method based on a naive Bayes classifier is designed and provided for the first time, probability information modeling of new energy prediction errors is realized, new energy consumption is promoted, and guarantee is provided for power system scheduling.

As shown in fig. 1, the present invention provides a new energy prediction error modeling method based on naive bayes classification, which includes:

the method comprises the steps that a model training method based on cross validation is adopted, a naive Bayes classifier model is used, and a new energy prediction error classification model is constructed by using new energy actual output data, new energy prediction data and new energy prediction error data which are subjected to discrete processing; constructing a mapping relation from the actual output data of the new energy, the prediction data of the new energy and the generated energy data to the prediction error evaluation of the new energy at the present day;

and (3) performing example analysis by combining new energy data cleaning and a naive Bayes-based error classification model according to the actual output data and the new energy prediction data of the new energy, and performing probability evaluation on the new energy prediction error on a new test set.

The invention provides a new energy prediction error modeling method based on a naive Bayes classifier according to the actual output characteristics of new energy, the prediction distribution situation of the new energy and the prediction error distribution situation of the new energy and based on the autocorrelation and cross correlation among data, wherein the specific embodiment of the invention takes northwest new energy as an example for description, and is described according to the following steps by specifically combining the accompanying drawings:

the technical scheme adopted by the invention is that a new energy prediction error modeling method based on a naive Bayes classifier is provided according to the actual output characteristic of new energy, the prediction distribution situation of the new energy and the prediction error distribution situation of the new energy and based on autocorrelation and cross correlation among data, and the method is implemented according to the following steps:

step 1, acquiring actual output data of new energy and prediction data of the new energy, deeply cleaning the new energy data, and detecting abnormal values.

The step 1 is as follows: and acquiring new energy data and finishing data cleaning.

Step 1.1, acquiring the actual annual output data and the new energy prediction data of the new energy 2021 in northwest of China.

Step 1.2, due to the fluctuation, randomness and uncertainty of new energy output, potential safety hazards are brought to a power system when large-scale new energy power generation is connected to the grid, and meanwhile, technical problems are brought to power system scheduling. The abnormal value in the new energy power generation data refers to a data sample with a size which deviates from a normal fluctuation range of the data by a long distance. The new energy power generation is still in the construction and development stage, and due to the reasons of damage and maintenance of recording equipment, abnormal behaviors of data entry personnel, network intrusion attacks and the like, abnormal values generally exist in new energy power generation data, wind and light measurement data and numerical weather forecast data. And the output change of the new energy is directly influenced by a plurality of weather factors such as wind speed, wind direction and the like, so that the high randomness and uncertainty are provided, and the difficulty in identifying abnormal data is far greater than that of regular data represented by load data.

Step 1.3, for data samples like load data, STL decomposition of time series can be adopted to decompose the data samples into trend terms, seasonal terms and residual terms. The STL decomposition formula is as follows:

Y＝T+S+R (1)

in the formula, Y represents an original sequence, T represents a trend term, S represents a season term, and R represents a residual term.

And (3) analyzing a residual error item by utilizing the fitting of a traditional prediction model S-ARIMA, judging the residual error by utilizing a traditional abnormal detection model based on a threshold value method, wherein a larger value of the residual error is probably an abnormal value. The expression for ARIMA autoregression is as follows:

y _t ＝c+φ ₁ y _t-1 +...+φ _p y _t-p +ε _t +θ ₁ ε _t-1 +...+θ _q ε _t-q (2)

in the formula, p is the autoregressive order, q is the moving average order, and d represents the differential order.

For the new energy power generation power data with strong randomness and large fluctuation, the traditional single detection model has poor detection effect and has the performance characteristics of more missed detection and misjudgment, because the abnormal model defined by the self threshold method easily identifies larger fluctuation in new energy power generation as abnormal, and if the threshold is set to be larger, the abnormal missed detection is easily identified as normal data. The label-free detection method has the advantages of high calculation speed and high efficiency, but the detection precision is not provided with a supervised learning method and a mixed model, and misjudgment and missing detection are easily generated on the new energy power generation power data with high volatility and strong randomness. The supervised learning method has the advantage of high abnormality detection precision, and the performance of the supervised learning method on new energy generated power data is generally superior to that of the traditional unsupervised method.

And 1.4, adopting a hybrid anomaly identification model based on a Bagging integration learning framework, using a basic model isolated forest as a base classifier, and effectively improving the classification precision through Bagging integration. Training the basic base classifier model for multiple times, wherein each training sample is generated by random extraction, and finally, taking the average value of the results of each learner as a final result. The simple average strategy can greatly improve the classification and prediction effects of a single decision tree learner, and the more sensitive the base classifier is to data, the larger the difference between the classifiers is, and the better the final effect of Bagging integration is. In the training process of each base classifier, the Bagging integration framework randomly selects features as input, so that the correlation among different base classifiers is reduced, and the integration effect is favorably improved. The expression of the Bagging integration model is as follows:

wherein N is the number of constructed base models, g (x, alpha) is a single classification model, and alpha is a model parameter.

An isolated forest model is constructed as a base classifier, the basic idea of the isolated forest is similar to multi-dimensional hyperplane segmentation, a sample set S is assumed to be initially owned, a random hyperplane is selected to segment a data set to generate two subspaces, the hyperplane is selected randomly to segment the two subspaces, and the step is repeated until each subspace only contains one sample point. So far, each sample point corresponds to a division number, and the value describes the number of hyperplanes used for independently dividing the sample point. Normal samples at high density regions should have a larger number of splits, while abnormal points at sample boundaries will have a smaller number of splits.

The construction of the isolated forest is divided into two stages of training and integration, and the training process of a single tree (namely the isolated tree) is as follows:

generating a hyperplane by using a cutting point p, dividing a training sample into two subspaces, and simultaneously dividing the samples smaller than p and larger than p under the selected dimensionality omega into two types to respectively form left and right branches of the node;

and (4) repeating the steps (2) and (3) to continuously generate new leaf nodes in the method until the branch of the new child node only contains one sample point.

And 2, performing probability density distribution fitting on the actual output data of the new energy, the predicted data of the new energy and the predicted error data of the new energy based on a kernel density estimation method and by combining the new energy data counted in the step 1 to obtain probability density distribution curves of the three data. The method comprises the steps of carrying out discretization processing on three types of new energy data by utilizing a self-organizing mapping neural network (SOM), inputting new energy processing, predicted values and prediction error data, obtaining the clustering number and the classification boundary of the three types of data through competitive neuron training, and carrying out discretization processing on the data.

The step 2 is as follows: and (3) performing probability density distribution fitting on the actual output data of the new energy, the new energy prediction data and the new energy prediction error data based on a kernel density estimation method and in combination with the new energy data counted in the step (1) to obtain probability density distribution curves of the three data. The method comprises the steps of discretizing three types of new energy data by using a self-organizing mapping neural network (SOM), inputting new energy processing, predicted values and prediction error data, training by competitive neurons to obtain the clustering number and the classification boundary of the three types of data, and discretizing the data.

The step 2 specifically comprises the following steps:

obtaining a probability density distribution curve of the new energy source by using historical new energy source output data cleaned by abnormal data, new energy source prediction data and new energy source prediction error data obtained by calculation through a kernel density estimation method;

and clustering the new energy data by adopting a self-organizing mapping neural network (SOM).

The self-organizing map is a dimension reduction algorithm that learns data in an input space to generate a low-dimensional, discrete map. And (4) applying a competitive learning strategy and gradually optimizing the network by depending on mutual competition among the neurons. And a neighbor relation function is used to maintain the topology of the input space. Adjacent samples in the input space are mapped to adjacent output neurons. Assuming D dimension of input space, the input mode is x = { x = _i I =1, 2.. Multidot.d }, the connection weight between the input unit i and the neuron j at the computation level is ω = { ω = ω } _i,j J =1, 2.. N, i =1, 2.. D }, where N is the total number of neurons.

Discretizing the new energy output data, the new energy prediction data and the new energy prediction error data according to the self-organizing mapping neural network clustering result.

The basic principle of kernel density estimation is shown in fig. 2, a gaussian kernel function is placed at each data sample position, and then all kernel functions are summed to obtain a smooth probability density function. The kernel density estimation theory belongs to a nonparametric method, and the distribution type of a probability density function does not need to be assumed in advance. For new energy data, due to the fact that randomness and volatility of the new energy power generation are strong, a fixed probability distribution shape does not exist, the method can better adapt to the power change situation of the new energy by adopting a nonparametric method, and a probability prediction result which is more in line with reality is given.

The selection of the kernel density estimation bandwidth h will have a large impact on the obtained estimation result. If a gaussian kernel function is used for kernel density estimation, the empirical estimation formula of the optimal bandwidth h is:

the probability density distribution curves of the historical new energy actual output data, the new energy prediction data and the new energy prediction error data obtained through calculation are obtained through a kernel density estimation method by using the abnormal data cleaned historical new energy actual output data, the new energy prediction data and the new energy prediction error data, and are shown in fig. 3, 3 and 4.

The architecture of the self-organizing map neural network is shown in fig. 6. The self-organizing map is a dimension reduction algorithm that learns data in an input space to generate a low-dimensional, discrete map. And (4) applying a competitive learning strategy and gradually optimizing the network by depending on mutual competition among the neurons. And a neighbor relation function is used to maintain the topology of the input space. Adjacent samples in the input space are mapped to adjacent output neurons. Assuming D dimension of input space, the input mode is x = { x = { [ x ] _i I =1, 2.. Multidot.d., and the connection weight between the input unit i and the neuron j at the computation level is ω = { ω = _i,j J =1, 2.. N, i =1, 2.. D }, where N is the total number of neurons.

The self-organizing mapping neural network training process is as follows:

and (5) initializing. There are three common initialization methods: random initiation: adapted to have little or no a priori knowledge of the input data; initiation using initial samples: the method has the advantages that at the initial moment, the network nodes are very similar to the topological structure of the input data; linear initialization (PCA): the network is allowed to extend in the direction of maximum input data capability.

And (4) competing process. The neuron calculates the respective discriminant function value of each input mode, and declares the specific neuron with the smallest discriminant function value as the winner, wherein the discriminant function of each neuron j is:

and (5) a cooperation process. The winning neuron I (x) determines the spatial position of the topological neighborhood of the excitatory neuron. After determining the active node I (x), we also want to update the nodes that are adjacent to it. The calculation formula of the update degree is as follows:

wherein S is _i,j Representing the distance of neurons i and j, σ (t) decays over time. That is, the farther a node is, the smaller the degree of update, and the more penalty is obtained.

And (4) adapting. The connection weights of the relevant excitatory neurons are appropriately adjusted so that the winning neuron's response to subsequent applications of similar input patterns is enhanced.

And (6) iteration. And continuing to return to the step 2 until the feature mapping tends to be stable. After the iteration is over, the neuron activated for each sample is its corresponding class.

The results of discretizing the actual output data of the new energy, the prediction data of the new energy and the prediction error data of the new energy according to the clustering result of the self-organizing mapping neural network are shown in table 1, table 2 and table 3:

TABLE 1 SOM clustering results of actual outputs of new energy

TABLE 2 SOM clustering results of new energy prediction data

TABLE 3 New energy prediction error SOM clustering results

And 3, constructing a new energy prediction error classification model by using the discrete processed new energy actual output data, new energy prediction data and new energy prediction error data by using a naive Bayes classifier model based on a cross validation model training method. And establishing a mapping relation from the current new energy actual output data, the new energy prediction data and the generated energy data to the new energy prediction error evaluation by using the discretized new energy actual output data, the new energy prediction error data and the new energy generated energy data and combining the self characteristics of the new energy in northwest and the autocorrelation and cross correlation between the new energy prediction and the actual output data.

The step 3 is: the model training method based on cross validation uses a naive Bayes classifier model, and utilizes the new energy actual output data, the new energy prediction data and the new energy prediction error data which are subjected to discrete processing to construct a new energy prediction error classification model. And establishing a mapping relation from the current new energy actual output data, the new energy prediction data and the generated energy data to the new energy prediction error evaluation by using the discretized new energy actual output data, the new energy prediction error data and the new energy generated energy data and combining the self characteristics of the new energy in northwest and the autocorrelation and cross correlation between the new energy prediction and the actual output data.

And 3.1, modeling northwest new energy prediction error data by adopting a naive Bayes classifier, wherein the input of naive Bayes is discrete data, and therefore, discretization is carried out on the three data according to the actual new energy output clustering result, the new energy prediction data clustering result and the prediction error classification result. The predicted data of previous day is used as the first input x ₁ The actual force of the previous two days is taken as the second input x ₂ Using the current day prediction as input x ₃ Finally, the average generated energy of the previous two days is taken as x ₄ And taking the prediction error as the ylabel to train a naive Bayes classifier.

Constructing a basic naive Bayes probability model:

p(C|F ₁ ,...,F _n )(7)

and modifying the model according to a Bayesian formula to obtain the following formula:

wherein p (C) is prior probability, p (C | F) ₁ ,...,F _n ) Is the posterior probability; when any characteristic of the sample needing to be predicted is unknown, the probability that the sample is in a certain category is judged to be p (C), and then the characteristic F of the sample is known ₁ ＝f ₁ ,...,F _n ＝f _n Then, multiply by

The conditional probability that the sample belongs to that category is obtained. The promotion effect is exerted when the factor is larger than 1, and the inhibition effect is exerted when the factor is smaller than 1.

The denominator is not dependent on C and the value of the feature is also given, so the denominator can be considered as a constant. The molecule p (C) × p (F) is thus ₁ ,...,F _n I C) is equivalent to the joint distribution model:

according to a naive bayes classifier definition: assuming that each feature is independent of the others, i.e. independent of each other, it follows:

p(F _i |C,F _j )＝p(F _i |C) (10)

step 3.2, adopting a K-fold cross validation model training method, and specifically comprising the following operation steps:

1. dividing all data sets into K parts;

2. repeatedly taking one part of the test set every time, training the model by using the other four parts of the test set, and then calculating the evaluation index RMSEi of the model on the test set;

3. the K RMSEi were averaged to obtain the final RMSE.

/>

K =5 was selected in the example test.

The 'naive' assumption has the advantage of greatly reducing the calculation amount on the premise of not influencing the precision. In the case where the amount of data is large, the probability to the right of equation 9 is not estimable. In the actual engineering problem, there are often very many features, and the value of each feature is also very many, and the estimation of the value of the probability later through statistics can hardly be completed. The characteristics are assumed to be independent, the conditional independence is assumed for conditional probability distribution by the naive Bayes method, and the naive Bayes method is simple and easy to calculate, but certain classification accuracy is sacrificed. Considering the mutual independence between features, the conditional probability distribution of the variable C can be expressed as:

wherein Z is dependent only on F ₁ ,...,F _n Z is a constant when the characteristic variable is known.

And constructing a classifier from the probability model, wherein the naive Bayes classifier comprises the model and a corresponding decision rule. The maximum a posteriori probability (MAP) decision rule idea is adopted, and the decision rule is to select the label with the highest probability. The naive bayes classifier is defined as follows:

and 4, according to actual output data of new energy and new energy prediction data in 2021 year in northwest of China, carrying out example analysis by combining the new energy data cleaning in the step 1 and the specific scheme of the error classification model based on naive Bayes in the

steps

2 and 3, and carrying out probability evaluation on the new energy prediction error on a new test set to prove the effectiveness of the method.

Step 4 is as follows: and the practical new energy output data and the new energy prediction data of 2021 northwest of the calculation example are used as measuring and calculating bases to analyze so as to verify the feasibility of the new energy prediction error modeling method based on naive Bayes. And meanwhile, comprehensive evaluation index analysis is carried out on the new energy prediction error classification result.

Step 4.1, firstly, the actual output data of the new energy in northwest, the predicted data characteristics of the new energy and the probability density distribution of the new energy are researched, as can be seen from fig. 3 and fig. 4, the actual output of the new energy has obvious double-peak aggregation distribution, the overall level is different from 0 to 60000MW/h, the predicted data of the new energy has obvious double-peak aggregation distribution, and the overall level is different from 0 to 80000 MW/h. The probability density distribution diagram of the actual output and the predicted data of the new energy can be observed and divided into five parts, namely a first-stage climbing part, a 'double-peak' part, a slow descending part after a first peak value and a final descending stage.

And 4.2, clustering the data by using a self-organizing mapping neural network clustering method according to the probability density distribution, carrying out inverse discretization division on the data by using a clustering result, and comparing the clustering result with the data probability density distribution. The self-organizing mapping neural network divides the actual output data of the new energy into 5 types. The first stage is approximately from 0 to 15000 and better corresponds to a first grade climbing stage in a probability distribution curve; the second level is approximately from 15000-25000, corresponding to the first peak in the probability distribution curve; the third level is approximately from 25000 to 36000 and corresponds to a stage of slowly descending after the peak value in the probability distribution curve; the fourth level is approximately from 36000-48000, corresponding to the second peak in the probability distribution curve; the last stage describes the tail of the probability distribution curve better from 48000 to 74000. The self-organizing map neural network classifies the new energy prediction data into 5 types. The first stage is approximately from 0 to 15000 and better corresponds to a first grade climbing stage in a probability distribution curve; the second stage is approximately from 15000-25000, corresponding to the first peak in the probability distribution curve; the third stage is approximately from 25000 to 37000, corresponding to the stage of slow decline after the peak in the probability distribution curve; the fourth level is approximately from 36000-49000, corresponding to the second peak in the probability distribution curve; the last stage describes the tailing of the probability distribution curve better from 48000 to 100000.

And 4.3, as can be seen from the new energy prediction error probability distribution diagram, the three types of data have unbalanced distribution. Most machine-learned classification models have poor classification effect on unbalanced data, and meanwhile, most types of samples have large range, are different from (-2400) -2200MW/h, and have small significance for practical guidance, so that most types of data need to be further subdivided, the data are further divided according to the number of samples, the data are further divided into 4 types, and the prediction error data are divided into 6 types in total. As shown in the following table:

TABLE 4 discretization method of new energy prediction error

Step 4.4, comprehensively evaluating the new energy prediction error classification resultAnd (5) index analysis. The results of the naive bayes modeling classification of the new energy prediction errors on the test set are shown in fig. 7, which shows a confusion matrix of the classification of the prediction errors on the test set. As shown in the figure, the vertical axis represents the true labels of the samples, the horizontal axis represents the meta-learning prediction labels, and the shade of the color block represents the size of the number of samples, and the darker the color is, the larger the number of samples is. From the characteristics of the confusion matrix, the more sparse the matrix is, the higher the prediction accuracy is. In this example, the total number of samples in the test set is 7008, the number of samples in the wrong classification is 1497, the number of samples in the correct classification is 5511, and the accuracy is as high as 79%. The method has the advantages that the data characteristics and the distribution condition of the new energy in northwest are comprehensively analyzed, the new energy prediction error is modeled, the new energy prediction reliability is improved, the new energy consumption and the safe and stable operation of a power grid are promoted, the probability information modeling of the new energy prediction error is realized, meanwhile, the scheduling decision of the power system is guaranteed, and the aim of achieving double carbon is facilitated.

Example of the implementation

Based on the new energy prediction error modeling method, the method selects real new energy data of 2021 years in northwest of China all the year round, performs data measurement and calculation analysis to verify the feasibility of the new energy prediction error modeling method based on naive Bayes, and performs comprehensive evaluation index analysis on new energy prediction error classification results. The data comprise actual output data of the new energy and predicted data of the new energy, the data are not cleaned, and the time resolution of the data is 15min.

Firstly, the characteristic data characteristics of new energy data are researched, and a new energy actual output data, new energy prediction data grade division and discretization method is provided. The example results show that the clustering method based on the self-organizing mapping neural network well describes the probability density distribution conditions of the actual output data and the predicted data of the new energy, and shows the climbing, peak and descending stages of the probability density curve. The grading precision of the new energy data is effectively improved.

Secondly, the invention provides a naive Bayes new energy prediction error classification model based on a cross validation training mode. By modeling the self characteristics of the northwest new energy, the autocorrelation and the cross correlation between the new energy prediction and the actual output data, and constructing the mapping relation from the new energy actual output data, the new energy prediction data and the generated energy data to the new energy prediction error evaluation, the new energy prediction error modeling precision is greatly improved, wherein the total number of samples in a test set is 7008, the number of samples in wrong classification is 1497, the number of samples in correct classification is 5511, and the new energy prediction error classification accuracy reaches 79%.

Compared with the traditional new energy prediction error processing method, the new energy prediction error modeling method based on naive Bayes can remarkably improve the classification accuracy of new energy prediction errors, improve the reliability of new energy prediction, provide basis for power system scheduling planning, and promote new energy consumption and safe and stable operation of a power grid. In conclusion, the invention provides a new energy prediction error modeling method based on a naive Bayes classifier for the first time, so that probability information modeling of the new energy prediction error is realized, new energy consumption is promoted, and guarantee is provided for power system scheduling.

The result shows that the new energy prediction error modeling method based on naive Bayes can remarkably improve the classification precision of the new energy prediction error, improve the new energy prediction reliability, provide basis for power system scheduling planning, and promote new energy consumption and safe and stable operation of a power grid.

As shown in fig. 8, the present invention further provides a new energy prediction error modeling system based on naive bayes classification, which includes:

the data cleaning module is used for acquiring actual output data of the new energy and predicted data of the new energy, cleaning the actual output data of the new energy and the predicted data of the new energy, removing abnormal values and obtaining predicted error data of the new energy;

the relation mapping module is used for constructing a new energy prediction error classification model by using a naive Bayes classifier model and utilizing the new energy actual output data, the new energy prediction data and the new energy prediction error data which are subjected to discrete processing based on a cross validation model training method; constructing a mapping relation from the actual output data of the new energy, the predicted data of the new energy and the generated energy data to the predicted error evaluation of the new energy in the day ahead;

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. The new energy prediction error modeling method based on naive Bayes classification is characterized by comprising the following steps:

constructing a new energy prediction error classification model by using a naive Bayesian classifier model and using the actual output data of the new energy, the prediction data of the new energy and the prediction error data of the new energy which are subjected to discrete processing based on a cross validation model training method; constructing a mapping relation from the actual output data of the new energy, the prediction data of the new energy and the generated energy data to the prediction error evaluation of the new energy at the present day;

2. The new energy prediction error modeling method based on naive Bayes classification as claimed in claim 1, wherein said obtaining actual output data of new energy and new energy prediction data, performing data cleaning on actual output data of new energy and new energy prediction data, removing abnormal value, obtaining new energy prediction error data; the method comprises the following steps:

acquiring annual actual output data and new energy forecast data of new energy;

removing abnormal values in the new energy power generation data, wherein the abnormal values refer to data samples with sizes deviating from the normal fluctuation range of the data;

Y＝T+S+R

residual error items are analyzed by using a prediction model S-ARIMA autoregressive fitting, and the expression of the ARIMA autoregressive fitting is as follows:

and (3) adopting a hybrid anomaly identification model based on a Bagging ensemble learning framework, using a basic model isolated forest as a base classifier for the hybrid anomaly identification model, training the basic base classifier model for multiple times, randomly extracting a training sample for each time, and finally taking the average value of the results of all learners as a final result to serve as new energy source prediction error data.

3. The naive Bayesian classification-based new energy prediction error modeling method according to claim 2, wherein in the Bagging ensemble learning framework-based mixed anomaly identification model, an expression of the Bagging ensemble model is as follows:

constructing an isolated forest model as a base classifier, wherein the basic idea of the isolated forest is similar to multi-dimensional hyperplane segmentation, initially possessing a sample set S, selecting a random hyperplane to segment a data set to generate two subspaces, then randomly selecting the hyperplane to segment the two subspaces, and repeating until each subspace only contains one sample point; so far, each sample point corresponds to a division number.

4. The naive Bayes classification based new energy prediction error modeling method as claimed in claim 3, wherein the construction of the isolated forest model is divided into two stages, i.e. training and integration, comprising:

generating a hyperplane by using a cutting point p, dividing a training sample into two subspaces, and dividing the samples smaller than p and larger than p under the selected dimension omega into two types to respectively form left and right branches of the node;

5. The new energy prediction error modeling method based on naive Bayes classification as claimed in claim 1, wherein said kernel density estimation method is used for performing probability density distribution fitting on actual output data of new energy, predicted data of new energy and predicted error data of new energy to obtain probability density distribution curves of three kinds of data; discretizing three types of new energy data by using a self-organizing mapping neural network (SOM), taking actual output data of new energy, new energy prediction data and new energy prediction error data as input, training by competitive neurons to obtain the clustering number and classification boundary of the three types of data, and discretizing the data; the method comprises the following steps:

selecting a kernel density estimation bandwidth h, which has a large influence on an obtained estimation result; using a gaussian kernel function to perform kernel density estimation, an empirical estimation formula of the optimal bandwidth h is as follows:

clustering new energy data by adopting a self-organizing mapping neural network (SOM); the self-organizing mapping neural network generates a low-dimensional discrete mapping by learning data in an input space; a competitive learning strategy is applied, and the network is gradually optimized by means of mutual competition among neurons; maintaining a topology of an input space using a neighbor relation function; adjacent samples in the input space will be mapped to adjacent output neurons; d dimension of input space, input mode is x = { x = { [ x ] _i I =1, 2.. Multidot.d }, the connection weight between the input unit i and the neuron j at the computation level is ω = { ω = ω } _i,j J =1, 2.. N, i =1, 2..., D }, where N is the total number of neurons;

6. The naive bayes classification based new energy prediction error modeling method according to claim 5, wherein the self-organizing map neural network training process is as follows:

initializing;

and then discretizing the actual output data of the new energy, the prediction data of the new energy and the prediction error data of the new energy according to the clustering result of the self-organizing mapping neural network.

7. The new energy prediction error modeling method based on naive Bayes classification as claimed in claim 1, wherein the model training method based on cross validation uses a naive Bayes classifier model to construct a new energy prediction error classification model by using the discretized actual output data of the new energy, the predicted data of the new energy and the predicted error data of the new energy; the method comprises the following steps of constructing a mapping relation from the current new energy actual output data, the new energy prediction data and the generated energy data to the new energy prediction error evaluation by using the discretized new energy actual output data, the new energy prediction error data and the new energy generated energy data and combining the new energy self characteristics, the autocorrelation and the cross correlation between the new energy prediction and the actual output data, and comprises the following steps of:

constructing a basic naive Bayes probability model:

p(C|F ₁ ,...,F _n )

A joint distribution model:

p(C|F ₁ ,...,F _n )∝p(C)*p(F ₁ ,...,F _n |C)

∝p(C)*p(F ₁ |C)*p(F ₂ ,...,F _n |C,F ₁ )

∝p(C)*p(F ₁ |C)*p(F ₂ |C,F ₁ )*...*p(F _n |C,F ₁ ,...,F _n-1 )

p(F _i |C,F _j )＝p(F _i |C)

the conditional probability distribution of the variable C is expressed as:

8. the naive Bayes classification-based new energy prediction error modeling method as claimed in claim 1, wherein the probability evaluation of the new energy prediction error on a new test set is performed by performing a new energy actual output data and a new energy prediction data in combination with new energy data cleaning and naive Bayes-based error classification model expansion example analysis, and comprises:

acquiring three types of data with unbalanced distribution from the new energy prediction error probability distribution diagram;

9. The naive Bayes classification based new energy prediction error modeling method as claimed in claim 1, wherein the clustering is performed on the data by using a self-organizing mapping neural network clustering method according to probability density distribution, the clustering result is used for carrying out inverse discretization division on the data, and the clustering result is compared with the data probability density distribution; the method comprises the following steps:

clustering the data by using a self-organizing mapping neural network clustering method according to the probability density distribution, carrying out retrograde discretization division on the data by using a clustering result, and comparing the clustering result with the data probability density distribution; the self-organizing mapping neural network divides the actual output data of the new energy into 5 types;

the first stage is from 0 to 15000 and corresponds to a first climbing stage in a probability distribution curve;

the second stage, from 15000-25000, corresponds to the first peak in the probability distribution curve;

the last stage, from 48000-74000, corresponds to the tail of the probability distribution curve;

the last stage, from 48000 to 100000, corresponds to the tail of the probability distribution curve.

10. A new energy prediction error modeling system based on naive Bayes classification is characterized by comprising the following steps: