CN113052386A

CN113052386A - Distributed photovoltaic daily generated energy prediction method and device based on random forest algorithm

Info

Publication number: CN113052386A
Application number: CN202110333708.9A
Authority: CN
Inventors: 艾宇飞; 来广志; 谢祥颖; 解鸿斌; 马晓光; 蔡世霞; 周专; 单雨; 王少婷; 刘润彪; 唐洋
Original assignee: State Grid Tianjin Electric Power Co Ltd; State Grid Xinjiang Electric Power Co Ltd; State Grid E Commerce Co Ltd
Current assignee: State Grid Tianjin Electric Power Co Ltd; State Grid Xinjiang Electric Power Co Ltd; State Grid E Commerce Co Ltd
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2021-06-29
Anticipated expiration: 2041-03-29
Also published as: CN113052386B

Abstract

The method comprises the steps of respectively constructing a plurality of regression trees by using an original sample set consisting of a plurality of generated energy samples of a photovoltaic power station to obtain a generated energy prediction model consisting of the regression trees, and respectively analyzing meteorological features of a natural day to be predicted by using each regression tree contained in the generated energy prediction model to obtain a plurality of theoretical generated energies of the natural day to be predicted; and carrying out mean value calculation on each theoretical generated energy to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted. The method utilizes the generated energy prediction model composed of a plurality of regression trees to comprehensively analyze various meteorological characteristics of the natural day to be predicted, does not need to preset a function form, does not need to consider multiple linear correlations among independent variables, can better reflect the influence of various weather environment changes on the generated energy, and can ensure higher prediction precision under any meteorological conditions.

Description

Distributed photovoltaic daily generated energy prediction method and device based on random forest algorithm

Technical Field

The invention relates to the technical field of photovoltaic power generation, in particular to a distributed photovoltaic daily generated energy prediction method and device based on a random forest algorithm.

Background

At present, the energy structure of China still mainly uses coal, which is the biggest global sulfur dioxide and carbon dioxide emission country, and the international environmental protection responsibility pressure is huge. Energy conservation and emission reduction are far in priority, but the problem is urgently solved. Therefore, when the energy structure is in the strategic transformation period, China proposes to construct a clean, low-carbon, safe and efficient energy system. At present, an energy transformation system taking new energy as a fulcrum is being changed rapidly, and clean energy represented by distributed photovoltaic has important strategic significance for optimizing an energy structure, promoting energy conservation and emission reduction and realizing economic sustainable development.

Under the strong support of national policies and the active participation of market main bodies, the scale of the photovoltaic installation machine in China keeps increasing at a high speed for continuous ten years, and the new installation machine keeps above 40 GW. By the end of 2019, the national photovoltaic power generation accumulation loading machine reaches 20430 ten thousand kilowatts, the year-on-year increase is 17.3%, wherein the distributed photovoltaic 6263 ten thousand kilowatts, the year-on-year increase is 24.2%.

However, the output of new energy such as wind, light and the like has randomness and volatility, so that safe operation and effective consumption of new energy in grid connection are always the world problems and research hotspots. Especially, large-scale distributed photovoltaic grid connection can cause great influence on voltage and current of a power grid system, power flow of distribution network lines, reactive power compensation and the like. Therefore, it is very important to accurately predict the power generation amount of the photovoltaic power generation system, especially the distributed photovoltaic power generation system. However, most of the existing distributed photovoltaic power stations are distributed in remote areas, mountainous areas, roofs, agricultural greenhouses, fish ponds and other places, have the characteristics of dispersed distribution, small volume and the like, are limited by factors such as local climatic environment, geographical conditions and the like, and provide higher requirements and challenges for the prediction of the generated energy of the distributed photovoltaic power stations.

At present, methods for predicting the power generation capacity of a photovoltaic power station are common methods such as a statistical analysis method and an artificial intelligence algorithm. The statistical analysis method mainly carries out statistical analysis on data such as the power generation capacity of the photovoltaic power station and solar irradiance through methods such as regression analysis and time series analysis so as to predict the possible power generation capacity of the photovoltaic power station in the future. The artificial intelligence algorithm mainly comprises methods such as clustering, Markov chain, neural network and support vector machine, and has the advantages of high prediction precision, high response speed and the like.

In order to ensure higher prediction precision, the existing prediction methods can only predict the power generation amount under specific meteorological conditions, but cannot accurately predict the power generation amount under any meteorological conditions.

Disclosure of Invention

Based on the problems in the prior art, the invention provides a distributed photovoltaic daily generated energy prediction method and device based on a random forest algorithm, so that the daily generated energy of a photovoltaic power station under various meteorological conditions can be predicted more accurately.

The application provides a distributed photovoltaic daily generated energy prediction method based on a random forest algorithm, which comprises the following steps:

and (3) model construction process:

obtaining an original sample set consisting of a plurality of power generation amount samples of the photovoltaic power station; wherein each of the power generation amount samples corresponds to a natural day; the power generation capacity sample consists of meteorological features corresponding to natural days and actual power generation capacity of the photovoltaic power station; the meteorological features comprise a plurality of input variables representing meteorological conditions corresponding to natural days;

performing N times of repeated random sampling on the original sample set to obtain N training sample sets; wherein N is a preset positive integer;

for each training sample set, constructing a regression tree by using the training sample set; the power generation amount prediction model of the photovoltaic power station is composed of N regression trees constructed by N training sample sets;

wherein, the constructing a regression tree by using the training sample set comprises:

establishing a root node of a regression tree, and distributing the training sample set to the root node;

judging whether the current height of the regression tree is smaller than a preset height threshold value or not;

if the current height of the regression tree is smaller than the height threshold, determining each leaf node of the regression tree as a node to be split, and determining a plurality of corresponding characteristic variables for each node to be split; wherein each of the characteristic variables consists of one or more input variables in the meteorological characteristic;

for each node to be split, determining a cut point of each characteristic variable corresponding to the node to be split, and calculating a Gini index of each characteristic variable of the node to be split by using a sample set corresponding to the node to be split and the cut points of each characteristic variable;

selecting a characteristic variable with the minimum Gini index corresponding to each node to be split as a splitting variable of the node to be split;

for each node to be split, splitting a sample set corresponding to the node to be split into two sample sets by taking a splitting point of the splitting variable as a basis, and distributing the two sample sets obtained by splitting to two pre-created child nodes of the node to be split;

returning to the step of judging whether the current height of the regression tree is smaller than a preset height threshold value;

if the current height of the regression tree is not smaller than the height threshold value, outputting the regression tree to complete the process of constructing the regression tree;

the power generation amount prediction process:

acquiring meteorological features of a natural day to be predicted;

aiming at each regression tree contained in the power generation prediction model of the photovoltaic power station, analyzing the meteorological characteristics of the natural day to be predicted by utilizing the regression trees to obtain a theoretical power generation amount of the natural day to be predicted;

and carrying out average calculation on the N theoretical generated energies to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted.

The second aspect of the application provides a distributed photovoltaic daily generated energy prediction device based on a random forest algorithm, including:

an obtaining unit for obtaining an original sample set composed of a plurality of power generation amount samples of the photovoltaic power station; wherein each of the power generation amount samples corresponds to a natural day; the power generation capacity sample consists of meteorological features corresponding to natural days and actual power generation capacity of the photovoltaic power station; the meteorological features comprise a plurality of input variables representing meteorological conditions corresponding to natural days;

the sampling unit is used for performing N times of repeated random sampling on the original sample set to obtain N training sample sets; wherein N is a preset positive integer;

the construction unit is used for constructing a regression tree by utilizing the training sample set aiming at each training sample set; the power generation amount prediction model of the photovoltaic power station is composed of N regression trees constructed by N training sample sets;

when the construction unit constructs a regression tree by using the training sample set, the following steps are specifically executed:

the obtaining unit is used for obtaining the meteorological features of the natural day to be predicted;

the analysis unit is used for analyzing the meteorological characteristics of the natural day to be predicted by utilizing the regression trees aiming at each regression tree contained in the power generation amount prediction model of the photovoltaic power station to obtain a theoretical power generation amount of the natural day to be predicted;

and the calculation unit is used for carrying out mean value calculation on the N theoretical generated energies to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted.

The application provides a distributed photovoltaic daily generated energy prediction method and device based on a random forest algorithm, wherein the method comprises the steps of obtaining an original sample set consisting of a plurality of generated energy samples of a photovoltaic power station; wherein each power generation amount sample corresponds to a natural day; the power generation sample consists of meteorological features corresponding to natural days and actual power generation of the photovoltaic power station; the meteorological features comprise a plurality of input variables representing meteorological conditions corresponding to natural days; performing N times of repeated random sampling on the original sample set to obtain N training sample sets; wherein N is a preset positive integer; aiming at each training sample set, constructing a regression tree by using the training sample set; the method comprises the following steps that N regression trees constructed by N training sample sets form a power generation amount prediction model of the photovoltaic power station; the method for constructing the regression tree by utilizing the training sample set comprises the following steps: establishing a root node of the regression tree, and distributing a training sample set to the root node; judging whether the current height of the regression tree is smaller than a preset height threshold value or not; if the current height of the regression tree is smaller than the height threshold, determining each current leaf node of the regression tree as a node to be split, and determining a plurality of corresponding characteristic variables for each node to be split; each characteristic variable consists of one or more input variables in meteorological characteristics; determining a dividing point of each characteristic variable corresponding to each node to be divided, and calculating the Gini index of each characteristic variable of the node to be divided by using a sample set corresponding to the node to be divided and the dividing point of each characteristic variable; selecting a characteristic variable with the minimum Gini index corresponding to each node to be split as a splitting variable of the node to be split; for each node to be split, splitting a sample set corresponding to the node to be split into two sample sets by taking a splitting point of a splitting variable as a basis, and distributing the two sample sets obtained by splitting to two sub-nodes of the pre-created node to be split; returning to the step of judging whether the current height of the regression tree is smaller than a preset height threshold value; if the current height of the regression tree is not smaller than the height threshold value, outputting the regression tree to complete the process of constructing the regression tree; acquiring meteorological features of a natural day to be predicted; aiming at each regression tree contained in the power generation prediction model of the photovoltaic power station, analyzing the meteorological characteristics of the natural day to be predicted by utilizing the regression trees to obtain a theoretical power generation amount of the natural day to be predicted; and carrying out mean value calculation on the N theoretical generated energies to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted. The method utilizes the generated energy prediction model composed of a plurality of regression trees to comprehensively analyze various meteorological characteristics of the natural day to be predicted, does not need to preset a function form, does not need to consider multiple linear correlations among independent variables, can better reflect the influence of various weather environment changes on the generated energy, and can ensure higher prediction precision under any meteorological conditions.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a process for constructing a power generation amount prediction model according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for constructing a regression tree according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a regression tree according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a method for predicting the daily power generation of distributed photovoltaic systems based on a random forest algorithm according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a distributed photovoltaic daily power generation amount prediction device based on a random forest algorithm according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, photovoltaic power generation prediction methods are various, and the methods can be divided into ultra-short-term prediction, short-term prediction and medium-long-term prediction according to time scales; the prediction method can be classified into a statistical analysis method, an artificial intelligence algorithm and the like. The statistical analysis method mainly includes regression analysis, time series analysis and other methods, for example, an Autoregressive moving average (ARMA) model and the like, and the possible power generation amount of the future photovoltaic power station is predicted by performing statistical analysis on data such as the power generation amount of the photovoltaic power station and solar irradiance and the like. The artificial intelligence algorithm mainly comprises methods such as clustering, Markov chain, neural network and support vector machine, and has the advantages of high prediction precision, high response speed and the like, wherein a BP (Back Propagation) neural network and support vector machine prediction method are common artificial intelligence algorithms, and many other intelligent prediction models are improved and optimized by fusing other algorithms on the basis of the Back Propagation neural network and the support vector machine prediction method.

Among the existing prediction methods, some photovoltaic power generation prediction models have higher prediction accuracy, but the models are only suitable for output prediction in sunny days, and the prediction accuracy of other weather types is lower and the application range is small.

When the generated energy of some photovoltaic power generation prediction models is predicted, clustering needs to be carried out according to factors such as weather types and seasons, sub models are divided, the process is complicated, and the photovoltaic power generation prediction models are inconvenient to use in practice.

In addition, some prediction models have low precision or are single, so that the prediction result is not ideal.

In order to solve the problems in the prior art, the method comprises two parts, wherein the first part is to construct a power generation amount prediction model by utilizing actual power generation amounts of photovoltaic power stations on a plurality of past natural days and meteorological conditions of all natural days, and the second part is to analyze the meteorological conditions of natural days to be predicted (generally, one to two natural days in the future) by utilizing the constructed power generation amount prediction model so as to predict the power generation amount of the natural days to be predicted.

First, referring to fig. 1, in the power generation amount prediction method provided in the embodiment of the present application, a process of constructing a power generation amount prediction model may include the following steps:

s101, obtaining an original sample set consisting of a plurality of power generation amount samples of the photovoltaic power station.

Wherein each power generation amount sample corresponds to a natural day; the power generation sample consists of meteorological features corresponding to natural days and actual power generation of the photovoltaic power station; the meteorological features include a plurality of input variables that characterize meteorological conditions corresponding to a natural day.

For any one of the natural days, the day,

optionally, in order to ensure that the original sample set can cover as many meteorological conditions as possible, so as to improve the accuracy of the finally constructed power generation amount prediction model, in step S101, a power generation amount sample may be extracted from data of each natural day in the last year.

Specifically, the input variables characterizing the meteorological conditions included in one power generation amount sample may include irradiance, weather conditions, maximum temperature, average humidity, average wind speed, visibility, cloud cover, cumulative rainfall, average rainfall, season, etc. of the current day.

The season and the weather condition are discrete variables, the value of the season variable can be spring, summer, autumn and winter, and the value of the weather condition variable can be (sunny, cloudy, rainy, snowy, etc.). Other input variables outside the season and weather conditions are continuous variables, namely, the values of the other input variables are real numbers in a certain value range.

The various generated energy samples can be constructed by using original generated data and original meteorological data recorded by a photovoltaic power station in the past year, and meanwhile, because the original data possibly have defects and abnormalities, the original data can be preprocessed, and then the preprocessed data are used for constructing corresponding generated energy samples.

That is, step S101 may include:

the method comprises the steps of obtaining original power generation data and original meteorological data of the photovoltaic power station in a preset historical time period.

The original power generation data comprise the actual power generation amount of the photovoltaic power station in each natural day in a preset historical time period; the raw meteorological data comprises meteorological data of each natural day of the photovoltaic power station within a preset historical time period.

And carrying out data preprocessing on the original power generation data and the original meteorological data to obtain an original sample set.

The data preprocessing comprises the steps of data cleaning, abnormal value detection, missing value completion, standardization, normalization, quantitative characteristic binarization and the like. The method comprises the steps of data cleaning, standardization, normalization, quantitative characteristic binarization and the like, belongs to the existing technical means in the field of data processing, and can be realized by referring to relevant data in a specific implementation mode, which is not detailed herein.

The processes of outlier detection and missing value completion are described below, respectively:

(1) abnormal value detection:

the abnormal value detection mainly means that the actual power generation amount belonging to the abnormal value is detected in the actual power generation amount of each natural day recorded by the original power generation data, and in the operation process of the power station, the data acquisition equipment for detecting the actual power generation amount of the power station possibly fails to cause error of the actual power generation amount acquired on the same day, namely, the difference between the acquired actual power generation amount and the actual power generation amount of the power station is larger, so that the actual power generation amount belongs to the abnormal value. The abnormal value detection is to find out the abnormal actual power generation amount, delete the data of the natural day to which the abnormal actual power generation amount belongs, and not generate the power generation amount sample of the abnormal natural day (i.e. the natural day to which the abnormal actual power generation amount belongs) when generating the power generation amount sample subsequently.

Alternatively, the abnormal value detection may be performed by comprehensively detecting and judging by using a theoretical power generation amount deviation method and an irradiance power generation amount slope method.

The detection process by adopting a theoretical generating capacity deviation method comprises the following steps:

firstly, for any one past natural day, the theoretical power generation amount of the natural day is calculated by adopting the following formula:

in the above formula, E_pTheoretical power production (in kWh), P, for a photovoltaic plant on that natural day_AZFor the component installation capacity (in kWp) of a photovoltaic power plant, this parameter is determined by the structure of the photovoltaic power plant, and for a particular photovoltaic power plant this parameter is a constant, H_AIs the area of the photovoltaic power station, the total solar energy irradiation (unit kWh/m2) in the horizontal plane in the natural day, E_SThe irradiance under standard conditions, which term is constant, can be set to typically 1000W/m2, K₁Is a preset system correction factor and can be set between 75% and 85% in general.

Then, for the actual power generation amount of the natural day in the original power generation data, the actual power generation amount and the above-mentioned theoretical power generation amount E are calculated_pIf the deviation is within plus or minus 40% of the theoretical power generation amount, the actual power generation amount is a normal value, whereas if the deviation is outside the plus or minus 40% of the theoretical power generation amount, the actual power generation amount is an abnormal value.

An irradiance power generation slope method can be used for detecting an abnormal value as follows:

firstly, fitting the following formula by using the actual power generation amount Y of each natural day in the original power generation data and the irradiance X of each natural day in the original meteorological data:

Y＝K₂×X+b

wherein Y represents the actual daily power generation (kWh) of the photovoltaic power plant, and K₂And b is an intercept, and the values of the slope and the intercept are determined by the fitting process.

It can be seen that the above formula is equivalent to a straight line in the power generation amount-irradiance coordinate system, after the above formula is obtained by fitting, for each past natural day, a coordinate point can be determined in the power generation amount-irradiance coordinate system according to the irradiance and the actual power generation amount of the natural day, then the distance between the coordinate point and the straight line represented by the above formula is calculated, if the distance is greater than a preset distance threshold, the actual power generation amount of the natural day is determined as an abnormal value, otherwise, if the distance is less than or equal to the distance threshold, the actual power generation amount of the natural day is determined as a normal value.

Alternatively, the theoretical power generation amount deviation method and the irradiance power generation amount slope method may be used to detect an abnormal value at the same time, and if any one of the methods detects that a certain actual power generation amount is an abnormal value, the actual power generation amount is determined to be an abnormal value, and only when both methods determine that an actual power generation amount is a normal value, the actual power generation amount is regarded as a normal value.

(2) And (3) completing missing values:

due to the fact that communication faults or other faults occur in the photovoltaic power station, some data are null data, and the clustering filling method is adopted to fill the abnormal data.

Alternatively, assuming that the actual power generation amount of a certain natural day (denoted as Dn) is missing, the similarity between the weather data of other natural days (which refers to the natural days in which the actual power generation amounts are not missing) and the weather data of Dn may be calculated, for example, the euclidean distance between the weather data of other natural days and the weather data of Dn is calculated, then 6 natural days with the highest similarity are selected (when the similarity is characterized by the euclidean distance, the 6 natural days closest to the euclidean distance are selected), and the average value of the actual power generation amounts of the 6 natural days is selected to determine the actual power generation amount of Dn, thereby completing the completion of the actual power generation amount missing to Dn.

The specific calculation method of the euclidean distance may refer to the related prior art, and is not described in detail herein.

In step S101, the generated energy sample of each natural day constructed by using the original power generation data and the original weather data can be represented as:

(Y，X1，X2，X3，……X10，X11)-i

wherein i denotes that the power generation amount sample corresponds to the ith natural day in the past year, Y denotes the actual power generation amount of the ith natural day, and X1 to X11 are input variables, which in turn denote the irradiance (X1), weather conditions (X2), maximum temperature (X3), average temperature (X4), average humidity (X5), average wind speed (X6), visibility (X7), cloud cover (X8), cumulative rainfall (X9), average rainfall (X10), and season (X11) of the natural day. The X1 to X11 are the meteorological features of this power generation amount sample.

S102, performing N times of repeated random sampling on the original sample set to obtain N training sample sets.

Wherein N is a preset positive integer.

The process of performing a random sampling with a put back on the original sample set to obtain a training sample set comprises the following steps:

based on a Bootstrap (a method for estimating population through samples in statistics), a power generation amount sample is extracted from an original sample set, whether the extracted power generation amount sample exists in a training sample set which needs to be constructed currently is judged, if the extracted power generation amount sample exists in the training sample set, the power generation amount sample is put back to the original sample set, then the step of extracting the power generation amount sample from the original sample set is returned, if the extracted power generation amount sample does not exist in the training sample set, the power generation amount sample is added to the training sample set, the power generation amount sample is put back to the original sample set, then the step of extracting the power generation amount sample from the original sample set is returned, and the like is carried out until the training sample set contains M power generation amount samples. M is the number of samples in the preset training sample set, and may be set according to practical situations, for example, M may be set equal to 100.

By repeatedly executing the random sampling process with the replacement for N times, N training sample sets can be constructed by using the original sample set, and each training sample set comprises M generated energy samples belonging to the original sample set.

S103, aiming at each training sample set, a regression tree is constructed by utilizing the training sample set.

After the regression trees corresponding to each training sample set are constructed, the N regression trees form a random forest regression model, and the random forest regression model is the power generation amount prediction model to be constructed.

Optionally, after the N regression trees are constructed, the following test procedure may be further performed to test the average accuracy of each regression tree.

And S104, aiming at each regression tree, selecting a generated energy sample which does not belong to a training sample set used for constructing the regression tree in the original sample set as a test sample of the regression tree.

For example, for a constructed regression tree j (j is used to indicate that the regression tree belongs to the constructed first few regression trees, and the value range is 1 to N), the training sample set used for constructing the regression tree is recorded as a set j, and step S104 is to select the power generation amount sample not belonging to the set j in the original sample set as the test sample of the regression tree j.

Alternatively, one or more test samples may be selected.

And S105, analyzing the meteorological features contained in the test sample by using the regression tree aiming at each regression tree to obtain the theoretical power generation amount of the test sample of the regression tree.

In step S105, please refer to the following description of step S402 for a process of analyzing the meteorological features of the test sample by using the regression tree.

And S106, calculating the average accuracy of the regression tree according to the deviation between the theoretical power generation amount and the actual power generation amount of the test sample of the regression tree aiming at each regression tree.

For each regression tree, if only one test sample is selected in step S104, the deviation between the theoretical power generation amount and the actual power generation amount can be directly divided by the actual power generation amount of the test sample, and the obtained ratio is the average accuracy of the regression tree.

If a plurality of test samples are selected for a regression tree, the deviation between the theoretical power generation amount and the actual power generation amount of each test sample is divided by the actual power generation amount of each test sample to obtain the ratio of the test samples, and then the ratios of all the test samples selected for the regression tree are averaged to obtain the average accuracy of the regression tree.

It can be understood that, in step S103, each training sample set is utilized one by one to construct a regression tree corresponding to each training sample set, for example, for the training sample sets 1 to N, the training sample set 1 is utilized to construct a regression tree 1, the training sample set 2 is utilized to construct a regression tree 2, the training sample set 3 is utilized to construct a regression tree 3, and so on until the training sample set N is utilized to construct a regression tree N, and finally the regression tree 1 to the regression tree N constitute the power generation amount prediction model.

In this application, each regression tree may be constructed by using a CART (Classification and regression trees) decision algorithm, please refer to fig. 2, and a process of constructing a corresponding regression tree by using a training sample set may include the following steps:

s201, establishing a root node of the regression tree, and distributing the training sample set to the root node.

S202, judging whether the current height of the regression tree is smaller than a preset height threshold value.

If the current height of the regression tree is less than the height threshold, S202 is performed.

If the current height of the regression tree is greater than or equal to the height threshold, S207 is executed.

The height threshold may be set in advance according to actual conditions, and for example, the height threshold may be set to 10.

The height of the regression tree, which may be understood as the total number of levels contained in the regression tree, is exemplified by the regression tree shown in fig. 3, in which node 1 (i.e., the root node of the regression tree) constitutes the first level, node 2 and node 3 constitute the second level, and nodes 4 to 7 constitute the third level, the regression tree contains 3 levels, and the current height of the regression tree is 3 accordingly.

Obviously, when there is only one root node in the regression tree, the current height of this regression tree is equal to 1.

S203, determining each current leaf node of the regression tree as a node to be split, and determining a plurality of corresponding characteristic variables for each node to be split.

Wherein each characteristic variable consists of one or more input variables in the meteorological characteristic.

Optionally, each input variable may be defined as a characteristic variable, and at the same time, some of the input variables are combined to obtain another few characteristic variables.

As an example, in the present application, a number of characteristic variables may be defined as shown in table 1:

TABLE 1

Characteristic variable	Involving input variables	Characteristic variable	Involving input variables
				Z1	X1	Z10	X10
Z2	X2	Z11	X11
				Z3	X3	Z12	X3-X4
Z4	X4	Z13	X9-X10
				Z5	X5	Z14	X6÷X8
Z6	X6	Z15	X1÷X3
				Z7	X7
Z8	X8
				Z9	X9

In table 1, Z1 to Z15 represent 15 characteristic variables defined in the present application, and for the meanings of X1 to X11, reference is made to the aforementioned examples of the power generation amount samples. It can be seen that the characteristic variables Z1 to Z11 are equivalent to the corresponding input variables X1 to X11, i.e. the characteristic variable Z1 is equivalent to the input variable X1, i.e. to the irradiance for the natural day, and the characteristic variable Z3 is equivalent to the input variable X3, i.e. to the maximum temperature for the natural day.

The characteristic variables Z12 to Z15 are calculated from the input variables involved according to the formulae shown in table 1, for example, the characteristic variable Z12, equal to X3 minus X4 for the natural day, i.e. the maximum temperature minus the average temperature for the natural day.

For example, assuming that the maximum temperature X3 of a power generation sample is equal to 38 ℃ and the average temperature X4 is equal to 27 ℃, the characteristic variable Z12 of the power generation sample can be calculated to be equal to the difference between the two, i.e., 11 ℃.

It should be noted that the feature variables listed in table 1 are all feature variables that may be used in a specific embodiment, and when step S203 is executed, for a certain specific node to be split, a part of the feature variables in table 1 may be selected as the feature variables corresponding to the node, and it is not necessary to designate all the feature variables as the feature variables corresponding to the node to be split.

For example, for a certain node to be split, only the characteristic variables Z1, Z4, Z5 and Z13 in table 1 may be specified as the characteristic variables corresponding to the node to be split.

Generally, the selecting process may be performed randomly, for example, K (a preset positive integer, which is smaller than the total number of feature variables) feature variables are randomly selected from all predefined feature variables as the feature variables corresponding to the node to be split.

A leaf node is a node in a regression tree that has no children. Taking fig. 3 as an example, in the regression tree shown in fig. 3, the nodes 4 to 7 are leaf nodes, and if the nodes 2 to 7 in the regression tree shown in fig. 3 are deleted, only the node 1 (root node) is reserved, and at this time, the node 1 has no child node, then the node 1 is a leaf node.

S204, aiming at each node to be split, determining a cut point of each characteristic variable corresponding to the node to be split, and calculating the Gini index of each characteristic variable of the node to be split by using a sample set corresponding to the node to be split and the cut points of each characteristic variable.

It can be seen from the foregoing steps that some of the characteristic variables are identical to the input variables, and the input variables are divided into two types, namely, discrete variables and continuous variables, and accordingly, the characteristic variables can also be divided into discrete variables and continuous variables, it should be noted that only a plurality of continuous input variables can be combined into a new characteristic variable, and the combined characteristic variables (such as Z12 and Z13) are also continuous characteristic variables.

For the feature variables of the continuous type, the way of determining the segmentation point may be:

1.1, selecting a plurality of imaginary splitting points in the value range of the characteristic variable according to a certain value rule, for example, starting from the lower limit of the value range of the characteristic variable, selecting the imaginary splitting points at equal intervals according to a certain value interval until the upper limit of the value range is reached.

For example, when the maximum temperature is a characteristic variable, the range of the maximum temperature is-20 ℃ to 40 ℃ and the intervals between the maximum temperature and the maximum temperature are 10 ℃, the values of-10 ℃, 0 ℃, 10 ℃,20 ℃ and 30 ℃ which are designated at equal intervals are assumed splitting points and are sequentially represented by Q1 to Q5.

1.2, splitting the power generation amount samples in the sample set distributed to the nodes to be split according to whether the characteristic variable is larger than the virtual splitting point or not, wherein the power generation amount samples with the characteristic variable larger than or equal to the virtual splitting point form a first set (recorded as S1), and the power generation amount samples with the characteristic variable smaller than the virtual splitting point form a second set (recorded as S2).

For example, for the characteristic variable of the highest temperature, if the sample set of the nodes to be split is split based on the imaginary splitting point 10 ℃, the power generation amount sample with the highest temperature greater than or equal to 10 ℃ in the sample set can be formed into S1, and the power generation amount sample with the highest temperature less than 10 ℃ can be formed into S2.

It can be seen that, for the same feature variable, with different imaginary splitting points, S1 and S2 after splitting of the sample set S of the node to be split are different.

1.3, for each imaginary splitting point, for two sets S1 and S2 split from this imaginary splitting point, respectively calculating the kini coefficients Gini (S1) and Gini (S2) of these two sets, the calculation formula is as follows:

the calculation formula of Gini (S2) is consistent with Gini (S1), that is:

take the calculation formula of Gini (S1) as an example, where P_kThe proportion of the kth type of power generation amount sample in all the power generation amount samples in the set S1 is shown, in the application, K non-overlapping power generation amount intervals can be preset, each power generation amount interval corresponds to one sample type, on the basis, for each power generation amount sample, if the actual power generation amount Y contained in the power generation amount sample belongs to the kth interval, the corresponding power generation amount sample is the kth type of power generation amount sample, and the calculation formula of Gini (S1) is substantially P for calculating the proportion of the various power generation amount samples in the set S1_kThe sum of squares is then subtracted by 1, and the result is the kini coefficient of S1.

Similarly, calculation of Gini (S2)The essence of the formula is to calculate the proportion P of various types of power generation samples in the set S2_kThe sum of squares is then subtracted by 1, and the result is the kini coefficient of S2.

1.4, for a hypothetical splitting point Qi, the kini index gini (Qi) of this hypothetical splitting point can be calculated according to the following formula:

in the above formula, | S | represents the number of the power generation amount samples included in the sample set S assigned to the node to be split, and similarly, | S1| and | S2| represent the number of the power generation amount samples included in the sets S1 and S2, respectively, and Gini (S1) and Gini (S2) are the kini coefficients of the sets S1 and S2 split by the imaginary split point Qi calculated in 1.3.

It can be seen that for a feature variable, a plurality of imaginary splitting points can be determined, two sample sets can be split for each imaginary splitting point, and then the kini index of the imaginary splitting point is calculated, that is, finally, each imaginary splitting point has a corresponding kini index.

And 1.5, determining the virtual splitting point with the minimum Gini index in the plurality of virtual splitting points as the splitting point of the characteristic variable.

Taking the maximum temperature characteristic variable as an example, for the imaginary splitting points Q1 to Q5, if the kini index of Q4 is the smallest, Q4, i.e., 20 ℃, is determined as the splitting point of the characteristic variable of the maximum temperature.

For the discrete feature variable, the way of determining the cut point may be:

2.1, firstly, arranging all the alternative values of the characteristic variables in sequence to form an alternative value sequence.

For example, for a characteristic variable of a weather condition, the candidate values include sunny, cloudy, rainy and snowy, and the weather condition of a certain natural day may be sunny, cloudy, rainy or snowy, and the candidate values are combined into a candidate value sequence: (fine, cloudy, rainy, snowy).

Similarly, the alternative values for this characteristic variable of season include spring, summer, fall and winter, and the sequence of alternative values is: (spring, summer, autumn, winter).

And 2.2, determining each alternative value except the first alternative value and the tail alternative value in the alternative value sequence as a virtual splitting point.

Taking the weather condition as an example, the determined imaginary splitting points include cloudy, cloudy and rainy, which are sequentially marked as Q1 to Q3.

2.3, dividing the sample set of the nodes to be split into a first set S1 and a second set S2 according to the fact that the characteristic variables of the power generation amount samples are located on the left side or the right side of the imaginary split points for each imaginary split point.

Still taking the weather condition as an example, for a multi-cloud hypothetical splitting point, of all the power generation amount samples in the sample set S of the node to be split, the power generation amount sample of which the weather condition is located on the left side of the hypothetical splitting point or is consistent with the hypothetical splitting point may be classified into the set S1, and the power generation amount sample of which the weather condition is located on the right side of the hypothetical splitting point may be classified into the set S2. That is, the power generation amount samples of which weather conditions are sunny or cloudy are divided into the set S1, and the power generation amount samples of which weather conditions are cloudy, rainy, or snowy are divided into the set S2.

2.4, for each imaginary splitting point, for the two sets S1 and S2 split from this imaginary splitting point, respectively calculating the kini coefficients Gini (S1) and Gini (S2) of the two sets.

2.5, for the imaginary splitting point Qi, the Gini index (Qi) of the imaginary splitting point can be calculated according to the following formula.

And 2.6, determining the virtual splitting point with the minimum Gini index in the plurality of virtual splitting points as the splitting point of the characteristic variable.

2.4 to 2.6, which is consistent with the above 1.3 to 1.5, and will not be described again.

For a feature variable, when determining the splitting point of the feature variable, the kini index of each imaginary splitting point of the feature variable has been calculated, and therefore, in step S204, the calculation of the kini index of each feature variable of the node to be split by using the sample set corresponding to the node to be split and the splitting point of each feature variable may be substantially as follows:

and determining the Gini index of the cut point of the characteristic variable as the Gini index of the characteristic variable.

S205, aiming at each node to be split, selecting the characteristic variable with the minimum Gini index corresponding to the node to be split as the splitting variable of the node to be split.

For example, step S203 determines three characteristic variables of weather condition, highest temperature and average temperature for a node to be split, and then calculates the damping index of the weather condition, the damping index of the highest temperature and the damping index of the average temperature respectively in step S204, finds that the damping index of the highest temperature is the smallest, and then designates the highest temperature as the split variable of the node to be split.

S206, regarding each node to be split, splitting a sample set corresponding to the node to be split into two sample sets by taking the splitting point of the splitting variable as a basis, and distributing the two sample sets obtained by splitting to two sub-nodes of the node to be split which are created in advance.

The splitting process in step S206, and 1.2 (when the split variable is a continuous variable) or 2.3 (when the split variable is a discrete variable) described above,

with reference to the example of step S205, if the splitting point of the splitting variable, which is the highest temperature, is 10 ℃, then all the power generation amount samples in the sample set of the node to be split can be split:

dividing the power generation amount sample with the highest temperature less than or equal to 10 ℃ into a set S1;

the power generation samples with the maximum temperature greater than 10 ℃ are divided into a set S2.

In general, when two split sample sets are assigned to child nodes, S1 may be assigned to the child node on the left side, and S2 may be assigned to the child node on the right side.

After the execution of step S206 is completed, the process returns to step S202.

And S207, outputting the regression tree to complete the process of constructing the regression tree.

The process of constructing the regression tree will be described below by taking the regression tree shown in fig. 3 as an example.

Assuming that the height threshold is set to 3, a root node, i.e., node 1 shown in FIG. 3, is first established and a training sample set is assigned to node 1.

At this time, the height of the regression tree is 1 (only one root node and only one level is included), so that the node 1 is determined as the node to be split, by performing the processes of the foregoing steps S203 to S205, a splitting variable (i.e., the splitting variable 1 shown in fig. 3) and a corresponding splitting point (i.e., the splitting point 1 shown in fig. 3) are determined for the node 1, and based on the splitting variable 1 and the splitting point 1, the training sample set allocated to the node 1 is split into sample sets S1 and S2, where S1 is allocated to the left child node, i.e., the node 2 of fig. 3, and S2 is allocated to the right node 3, so that one cycle in the above method is completed.

At this time, it is determined that the level of the regression tree is 2 and is smaller than the height threshold 3, so that each current leaf node, that is, node 2 and node 3, is determined as a node to be split, the process of determining the splitting variable and the splitting point described in steps S203 to S206 is performed on node 2 and node 3, and the sample set is split, and finally, the set S1 allocated to node 2 in the previous cycle is split into two sample sets (which are not marked as S3 and S4), and the two sample sets are allocated to node 4 and node 5, respectively; the set S2 assigned to node 3 is split into sample sets S5 and S6, assigned to node 6 and node 7, respectively.

After the above allocation is completed, the height of the regression tree (which includes three levels and has a height of 3) is found to be equal to the height threshold, and then the building process is ended, and the built regression tree shown in fig. 3 is output.

By executing the method shown in fig. 2 on each training sample set, a regression tree can be constructed for each training sample set, and finally, N regression trees can be constructed for N training sample sets, and all regression trees form the power generation amount prediction model of the present application.

In summary, the method for constructing the regression tree shown in fig. 2 can be summarized as follows:

a training set of photovoltaic power station daily generated energy data samples is divided into two subsets by a binary recursion mode, and a Gini index is calculated, so that two subtrees of left and right branches are generated. When the nodes need to be further split, the node is divided once again by using the Gini index, and so on.

When the processing node is split, the kini indexes of each variable are required to be calculated once, one minimum kini index is selected, the corresponding variable is the variable to be split continuously, a regression tree is constructed through a recursion form according to the rule, and finally a classification rule is generated.

And (4) randomly selecting k independent variables as branch variables at the branch node of each tree while constructing the regression tree.

In the random Forest algorithm, there are two ways to generate feature variables, one is to combine input variables (Forest-RC) randomly, i.e. to operate a plurality of continuous input variables to obtain feature variables, and the other is to select input variables (Forest-RI) randomly, i.e. to directly designate an input variable as a feature variable. In the random Forest algorithm, feature variables are selected by Forest-RI, and in the process of generating each subtree, all feature variables are not generally involved in node splitting, but random feature variables are randomly extracted and assigned by a system, so that random node splitting is completed.

In the whole construction process, each regression tree recursively branches from top to bottom and grows continuously, and the N value of the number of the trees is set to serve as a termination condition for the growth of the regression tree.

In the process of constructing the regression tree by the random forest method, hundreds or even thousands of regression trees can be generated, and the parameter N is the number of trees in the random forest regression, so that the size of the forest is determined. Theoretical research shows that the generalization error of the random forest regression model gradually converges with the increase of N, so that a large enough N value is selected to ensure that the error of a training set tends to be stable when the model is constructed.

Thus, the generated N regression trees form a regression model of the random forest.

Based on the above power generation amount prediction model, an embodiment of the present application provides a method for power generation amount prediction, please refer to fig. 4, which may include the following steps:

s401, acquiring meteorological features of the natural day to be predicted.

The natural day to be predicted may be a day in the future, and assuming that the current date is 3 months and 1 day, the natural day to be predicted may be 3 months and 2 days, 3 months and 3 days, and the like.

The meteorological features of the natural day to be predicted also include irradiance, weather conditions, maximum temperature, average humidity, average wind speed, visibility, cloud cover, accumulated rainfall, average rainfall, season, etc. of the natural day to be predicted.

The future meteorological features can be predicted by various existing meteorological prediction technologies.

S402, analyzing meteorological features of the natural day to be predicted by using the regression trees aiming at each regression tree contained in the power generation amount prediction model of the photovoltaic power station to obtain a theoretical power generation amount of the natural day to be predicted.

It can be seen that, in step S402, each regression tree of the power generation amount prediction model analyzes the meteorological features of the natural day to be predicted, and finally, each regression tree outputs a theoretical power generation amount.

Specifically, analyzing the meteorological features of the natural day to be predicted by using the regression tree to obtain a theoretical power generation amount of the natural day to be predicted, which may include:

determining a root node of the regression tree as a current node;

determining the value of the meteorological features of the natural day to be predicted on the splitting variable of the current node;

determining a node corresponding to the meteorological features of the natural day to be predicted in two child nodes of the current node according to whether the value of the meteorological features of the natural day to be predicted on the splitting variable of the current node is larger than the splitting point of the splitting variable of the current node;

determining the node corresponding to the meteorological features of the natural day to be predicted as a current node, and returning to execute the step of determining the value of the meteorological features of the natural day to be predicted on the splitting variable of the current node until the current node is a leaf node of the regression tree;

calculating the average value of the actual power generation amount of all the power generation amount samples in the sample set corresponding to the current node, and determining the calculation result as a theoretical power generation amount of the natural day to be predicted.

Still taking the regression tree shown in fig. 3 as an example, after obtaining the meteorological feature of a natural day to be predicted, the value of the meteorological feature on the splitting variable of the node 1 may be determined, and if the splitting variable of the node 1 is the highest temperature, it is necessary to determine what the highest temperature of the natural day to be predicted is.

Assuming that the highest temperature of the natural day to be predicted is 30 ℃ and the division point of the node 1 is 10 ℃, the value of the meteorological feature of the natural day to be predicted on the splitting variable is greater than that of the node 1, and in the aforementioned process of constructing the regression tree, the sample of the power generation amount of which the splitting variable is greater than that of the division point is distributed to the right child node, i.e., the node 3 of fig. 3, and then the node 3 is determined as the current node.

The above process is then repeated for node 3, i.e. it is determined whether the meteorological features of the natural day to be predicted correspond to node 6 (i.e. the left child node of node 3) or node 7 (i.e. the right child node of node 3) based on the values of the meteorological features of the natural day to be predicted on the split variables of node 3 and the cut points of node 3.

It is assumed that the meteorological features of the natural day to be predicted correspond to the node 6. Since the node 6 is a leaf node, as can be seen from the above-mentioned construction process of the regression tree, in the constructed regression tree, each leaf node is assigned a sample set which includes a plurality of power generation amount samples classified according to the above respective splitting variables and the splitting points, so that an arithmetic mean of the actual power generation amounts of all the power generation amount samples in the sample set of the node 6 can be calculated, and the calculation result is determined as the theoretical power generation amount of the natural day to be predicted, which is output by the regression tree.

In summary, the analysis of a meteorological feature by the regression tree is essentially to classify the meteorological feature successively according to the assigned splitting variables and splitting points in each node, finally determine which leaf node of the regression tree the meteorological feature belongs to, and then calculate the corresponding theoretical power generation by using the actual power generation amount of all power generation amount samples contained in the leaf node to which the meteorological feature belongs.

It should be understood that the process of analyzing a meteorological feature by using a regression tree in step S402 is not only suitable for analyzing the meteorological feature of the natural day to be predicted, but also suitable for analyzing the meteorological feature of the test sample in step S105, and can also output a theoretical power generation amount corresponding to the test sample, and the theoretical power generation amount obtained by analyzing the test sample can be used for calculating the average accuracy of the regression tree.

And S403, carrying out mean value calculation on the N theoretical generated energies to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted.

An optional implementation manner of step S403 is:

and directly calculating the arithmetic mean value of the N theoretical generated energies, and determining the obtained calculation result as the predicted generated energy of the photovoltaic power station on the natural day to be predicted.

Another optional implementation manner of step S403 is:

if the corresponding average accuracy is calculated for each regression tree in the generated energy prediction model when the generated energy prediction model is constructed, the average accuracy of all regression trees contained in the generated energy prediction model can be normalized to obtain the accuracy coefficient of each regression tree.

For a specific normalization method, reference may be made to related prior art, and details are not described here.

And then carrying out weighted average on the N theoretical generated energy by using the accurate coefficients of the regression trees to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted.

The application provides a method for predicting the power generation amount of a photovoltaic power station, wherein an original sample set consisting of a plurality of power generation amount samples of the photovoltaic power station is obtained; wherein each power generation amount sample corresponds to a natural day; the power generation sample consists of meteorological features corresponding to natural days and actual power generation of the photovoltaic power station; the meteorological features comprise a plurality of input variables representing meteorological conditions corresponding to natural days; performing N times of repeated random sampling on the original sample set to obtain N training sample sets; wherein N is a preset positive integer; aiming at each training sample set, constructing a regression tree by using the training sample set; the method comprises the following steps that N regression trees constructed by N training sample sets form a power generation amount prediction model of the photovoltaic power station; acquiring meteorological features of a natural day to be predicted; aiming at each regression tree contained in the power generation prediction model of the photovoltaic power station, analyzing the meteorological characteristics of the natural day to be predicted by utilizing the regression trees to obtain a theoretical power generation amount of the natural day to be predicted; and carrying out mean value calculation on the N theoretical generated energies to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted. The method utilizes the generated energy prediction model composed of a plurality of regression trees to comprehensively analyze various meteorological characteristics of the natural day to be predicted, does not need to preset a function form, does not need to consider multiple linear correlations among independent variables, can better reflect the influence of various weather environment changes on the generated energy, and can ensure higher prediction precision under any meteorological conditions.

The invention provides a photovoltaic power station power generation amount prediction method based on a random forest model (namely a model formed by combining a plurality of regression trees). The random forest model belongs to one of machine learning algorithms, does not need to preset a function form, does not need to consider multiple linear correlations among independent variables, and can automatically detect the influence and importance among characteristics, so that the prediction method is simple and feasible, has detailed input factors, can better reflect the influence of weather environment change, and can ensure higher prediction precision.

Based on the method for predicting the power generation amount of the photovoltaic power station provided by the above embodiment, the embodiment of the present application further provides a device for predicting the daily power generation amount of the distributed photovoltaic power station based on a random forest algorithm, please refer to fig. 5, and the device may include the following units:

an obtaining unit 501, configured to obtain an original sample set composed of multiple power generation samples of a photovoltaic power station; wherein each power generation amount sample corresponds to a natural day; the power generation sample consists of meteorological features corresponding to natural days and actual power generation of the photovoltaic power station; the meteorological features include a plurality of input variables that characterize meteorological conditions corresponding to a natural day.

A sampling unit 502, configured to perform N times of cyclic random sampling on an original sample set to obtain N training sample sets; wherein N is a preset positive integer.

A constructing unit 503, configured to construct a regression tree by using the training sample sets for each training sample set; the method comprises the following steps of establishing a model for predicting the power generation amount of the photovoltaic power station, wherein the model comprises N regression trees constructed by N training sample sets.

When the building unit 503 builds a regression tree by using the training sample set, the following steps are specifically executed:

establishing a root node of the regression tree, and distributing a training sample set to the root node;

if the current height of the regression tree is smaller than the height threshold, determining each current leaf node of the regression tree as a node to be split, and determining a plurality of corresponding characteristic variables for each node to be split; each characteristic variable consists of one or more input variables in meteorological characteristics;

determining a dividing point of each characteristic variable corresponding to each node to be divided, and calculating the Gini index of each characteristic variable of the node to be divided by using a sample set corresponding to the node to be divided and the dividing point of each characteristic variable;

for each node to be split, splitting a sample set corresponding to the node to be split into two sample sets by taking a splitting point of a splitting variable as a basis, and distributing the two sample sets obtained by splitting to two sub-nodes of the pre-created node to be split;

the obtaining unit 501 is configured to obtain meteorological features of a natural day to be predicted.

The analysis unit 504 is configured to analyze, for each regression tree included in the power generation amount prediction model of the photovoltaic power station, the meteorological features of the natural day to be predicted by using the regression tree, and obtain a theoretical power generation amount of the natural day to be predicted.

And the calculating unit 505 is configured to perform mean calculation on the N theoretical power generation amounts to obtain a predicted power generation amount of the photovoltaic power station on the natural day to be predicted.

Optionally, when the obtaining unit 501 obtains an original sample set composed of a plurality of power generation amount samples of the photovoltaic power station, the following steps are specifically performed:

acquiring original power generation data and original meteorological data of a photovoltaic power station in a preset historical time period; the original power generation data comprise the actual power generation amount of the photovoltaic power station in each natural day in a preset historical time period; the original meteorological data comprise meteorological data of each natural day of the photovoltaic power station in a preset historical time period;

performing data preprocessing on original power generation data and original meteorological data to obtain an original sample set; the data preprocessing comprises data cleaning, abnormal value detection and missing value completion.

Optionally, the apparatus further comprises a testing unit 506 for:

selecting a power generation amount sample which does not belong to a training sample set used for constructing the regression tree from an original sample set as a test sample of the regression tree aiming at each regression tree;

for each regression tree, analyzing meteorological features contained in a test sample of the regression tree by using the regression tree to obtain the theoretical power generation amount of the test sample of the regression tree;

and calculating the average accuracy of the regression trees according to the deviation between the theoretical power generation capacity and the actual power generation capacity of the test samples of the regression trees for each regression tree.

Optionally, the calculating unit 505 performs mean calculation on the N theoretical power generations to obtain a predicted power generation amount of the photovoltaic power station on the natural day to be predicted, and specifically executes:

normalizing the average accuracy of all regression trees contained in the power generation prediction model to obtain an accuracy coefficient of each regression tree;

and carrying out weighted average on the N theoretical generated energy by using the accurate coefficients of the regression trees to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted.

Optionally, when the analysis unit 504 analyzes the meteorological features of the natural day to be predicted by using the regression tree to obtain a theoretical power generation amount of the natural day to be predicted, the following steps are specifically performed:

determining a root node of the regression tree as a current node;

The specific working principle of the device for predicting the power generation amount of the photovoltaic power station provided by the embodiment of the application can refer to the method for predicting the power generation amount of the photovoltaic power station provided by any embodiment of the application and related steps in the method for constructing the power generation amount prediction model, and details are not described here.

The application provides a distributed photovoltaic daily generated energy prediction device based on a random forest algorithm, wherein an obtaining unit 501 obtains an original sample set consisting of a plurality of generated energy samples of a photovoltaic power station; wherein each power generation amount sample corresponds to a natural day; the power generation sample consists of meteorological features corresponding to natural days and actual power generation of the photovoltaic power station; the meteorological features comprise a plurality of input variables representing meteorological conditions corresponding to natural days; the sampling unit 502 performs N times of putting back random sampling on the original sample set to obtain N training sample sets; wherein N is a preset positive integer; the constructing unit 503 constructs a regression tree by using the training sample sets for each training sample set; the method comprises the following steps that N regression trees constructed by N training sample sets form a power generation amount prediction model of the photovoltaic power station; the constructing unit 503 constructs a regression tree by using the training sample set, including: establishing a root node of the regression tree, and distributing a training sample set to the root node; judging whether the current height of the regression tree is smaller than a preset height threshold value or not; if the current height of the regression tree is smaller than the height threshold, determining each current leaf node of the regression tree as a node to be split, and determining a plurality of corresponding characteristic variables for each node to be split; each characteristic variable consists of one or more input variables in meteorological characteristics; determining a dividing point of each characteristic variable corresponding to each node to be divided, and calculating the Gini index of each characteristic variable of the node to be divided by using a sample set corresponding to the node to be divided and the dividing point of each characteristic variable; selecting a characteristic variable with the minimum Gini index corresponding to each node to be split as a splitting variable of the node to be split; for each node to be split, splitting a sample set corresponding to the node to be split into two sample sets by taking a splitting point of a splitting variable as a basis, and distributing the two sample sets obtained by splitting to two sub-nodes of the pre-created node to be split; returning to the step of judging whether the current height of the regression tree is smaller than a preset height threshold value; if the current height of the regression tree is not smaller than the height threshold value, outputting the regression tree to complete the process of constructing the regression tree; the obtaining unit 501 obtains meteorological features of a natural day to be predicted; the analysis unit 504 analyzes the meteorological characteristics of the natural day to be predicted by using the regression tree for each regression tree included in the power generation amount prediction model of the photovoltaic power station to obtain a theoretical power generation amount of the natural day to be predicted; the calculation unit 505 performs mean calculation on the N theoretical generated energies to obtain a predicted generated energy of the photovoltaic power station on the natural day to be predicted. The method utilizes the generated energy prediction model composed of a plurality of regression trees to comprehensively analyze various meteorological characteristics of the natural day to be predicted, does not need to preset a function form, does not need to consider multiple linear correlations among independent variables, can better reflect the influence of various weather environment changes on the generated energy, and can ensure higher prediction precision under any meteorological conditions.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

Those skilled in the art can make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A distributed photovoltaic daily generated energy prediction method based on a random forest algorithm is characterized by comprising the following steps:

and (3) model construction process:

the power generation amount prediction process:

acquiring meteorological features of a natural day to be predicted;

2. The method of claim 1, wherein the obtaining a raw sample set consisting of a plurality of power generation samples of a photovoltaic power plant comprises:

acquiring original power generation data and original meteorological data of a photovoltaic power station in a preset historical time period; wherein the original power generation data comprises the actual power generation amount of the photovoltaic power station on each natural day within the preset historical time period; the original meteorological data comprises meteorological data of each natural day of the photovoltaic power station in the preset historical time period;

performing data preprocessing on the original power generation data and the original meteorological data to obtain an original sample set; wherein the data preprocessing comprises data cleaning, abnormal value detection and missing value completion.

3. The method of claim 1, wherein after constructing a regression tree using the set of training samples, further comprising:

selecting a power generation amount sample which does not belong to a training sample set used for constructing the regression tree from the original sample set as a test sample of the regression tree aiming at each regression tree;

and calculating the average accuracy of the regression trees according to the deviation between the theoretical power generation capacity and the actual power generation capacity of the test samples of the regression trees aiming at each regression tree.

4. The method of claim 3, wherein the calculating the average of the N theoretical power generations to obtain the predicted power generation amount of the photovoltaic power station on the natural day to be predicted comprises:

and carrying out weighted average on the N theoretical generated energies by using the accurate coefficients of the regression trees to obtain the predicted generated energy of the photovoltaic power station on the natural day to be predicted.

5. The method of claim 1, wherein analyzing the meteorological features of the natural day to be predicted by using the regression tree to obtain a theoretical power generation amount of the natural day to be predicted comprises:

determining a root node of the regression tree as a current node;

determining a node corresponding to the meteorological feature of the natural day to be predicted in the two child nodes of the current node according to whether the value of the meteorological feature of the natural day to be predicted on the splitting variable of the current node is larger than the splitting point of the splitting variable of the current node;

6. The utility model provides a distributed photovoltaic daily generated energy prediction device based on random forest algorithm which characterized in that includes:

7. The apparatus according to claim 6, characterized in that the obtaining unit, when obtaining an original sample set consisting of a plurality of samples of the power generation of the photovoltaic power plant, performs in particular:

8. The apparatus of claim 6, further comprising a test unit to:

9. The device according to claim 8, wherein the calculating unit performs an average calculation of the N theoretical power generation amounts, and when obtaining the predicted power generation amount of the photovoltaic power station on the natural day to be predicted, specifically performs:

10. The apparatus according to claim 6, wherein the analysis unit specifically performs, when analyzing the meteorological features of the natural day to be predicted using the regression tree to obtain a theoretical power generation amount of the natural day to be predicted:

determining a root node of the regression tree as a current node;