CN116245019A

CN116245019A - Load prediction method, system, device and storage medium based on Bagging sampling and improved random forest algorithm

Info

Publication number: CN116245019A
Application number: CN202310077201.0A
Authority: CN
Inventors: 李亚飞; 刘乙; 钱科军; 郑众; 谢鹰; 张显楚; 宋杰; 陈嘉栋
Original assignee: Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd; State Grid Electric Power Research Institute; Suzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd; State Grid Electric Power Research Institute; Suzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2023-06-09

Abstract

The invention discloses a load prediction method, a system, a device and a storage medium based on Bagging sampling and an improved random forest algorithm, belonging to the technical field of intelligent power grids and intelligent power consumption, wherein the method comprises the following steps: acquiring a historical load value; inputting the historical load value into the constructed random forest model to obtain a prediction result; according to the invention, based on the Bagging sampling method, a CART decision tree is constructed, abnormal and redundant information is processed more accurately and comprehensively during data processing, effective data is extracted to perform the next operation, so that the calculated amount is reduced, and the random forest model jointly constructed by the combined algorithm has the advantages of each algorithm, so that the short-term load prediction of the power system is more intelligent, and the accuracy of the short-term load prediction of the power system is effectively improved.

Description

Load prediction method, system, device and storage medium based on Bagging sampling and improved random forest algorithm

Technical Field

The invention relates to a load prediction method, a system, a device and a storage medium based on Bagging sampling and an improved random forest algorithm, and belongs to the technical field of intelligent power grids and intelligent power utilization.

Background

As one of indispensable works of an electric department, the accurate operation of the power load prediction ensures the efficient and safe operation of a power system, the maintenance plan is safely arranged, the stop and start of a generator set is efficiently and accurately controlled, the occurrence of extra trouble and accidents is reduced, the social benefit and the economic benefit are improved under the condition that the power generation cost is controlled to be the lowest, the normal operation of society is ensured, and the problem is fundamentally and practically solved.

The expert of students at home and abroad performs a great deal of research on a theoretical method of short-term power load prediction, and a plurality of model algorithms with excellent performance are applied to the field, so that the short-term load prediction enters into the age of rapid development, and the short-term load prediction method is generally considered to be divided into two main types, namely a traditional classical prediction method and a modern intelligent prediction method, and the traditional classical prediction method has simple principle but high limitation, often has low precision and causes larger error. With the development of artificial intelligence, the modern intelligent prediction method has extremely strong data processing capability, greatly improves the accuracy of power load prediction, but excessively strong simulation is often accompanied with the problem of large calculation amount.

In the prior art, in addition to the traditional time series method, regression analysis method and trend extrapolation method, other intelligent prediction methods such as an artificial neural network algorithm, a wavelet analysis method and a fuzzy theory exist for the prediction of the load of the power system, in general, a single algorithm is used, the workload is large, the calculation is complex, the prediction accuracy of the load of the power system is low, and more types of misjudgment exist.

Disclosure of Invention

The invention aims to provide a load prediction method, a system, a device and a storage medium based on Bagging sampling and an improved random forest algorithm, which solve the problems of low prediction accuracy, large calculated amount and the like in the prior art.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a loading prediction method based on Bagging sampling and improving a random forest algorithm, including:

acquiring a historical load value;

inputting the historical load value into the constructed random forest model to obtain a prediction result;

the random forest model is constructed by the following method:

acquiring an original sample set, and randomly not replacing samples from the original sample set by utilizing a Bagging algorithm to generate a plurality of training sets;

training each training set to obtain a corresponding CART decision tree;

all CART decision trees are assembled together to form a random forest model.

With reference to the first aspect, further, the historical load values include predicted 96-point load values and environmental data for day-before and seven-day-before.

With reference to the first aspect, further, the generating a plurality of training sets by using Bagging algorithm from the original sample set without randomly replacing the samples includes:

n training samples (d 1, d 2) were randomly decimated from the original sample set using boottrap method, dN), N cycles are performed, resulting in N training sets, and each training set is mutually incoherent.

With reference to the first aspect, further, the training each training set to obtain a corresponding CART decision tree includes:

dividing the training set into two subsets by using a CART algorithm, and continuously recursively dividing to enable each generated non-leaf node to have two branches, wherein the nodes are divided according to the Gini index minimum principle, and the expression of each Gini index divided by each node is as follows:

where D is the set before segmentation, D ₁ And D ₂ Is a split subset of two, gini (D ₁ ) Is D ₁ Gini index of Gini (D) ₂ ) Is D ₂ Gini index, gini of (a) _split (D) Is the Gini index of D.

In combination with the first aspect, further, in the constructed random forest model, the historical load values are tested and classified through a plurality of CART decision trees, and the final classification is obtained according to the preset proportion, so that a prediction result is obtained.

With reference to the first aspect, in a process of constructing the random forest model, the method further includes the step of setting parameters:

and setting a feature evaluation standard, the number of the maximum weak learners, the maximum feature number, the maximum depth of the decision tree, the minimum sample number required by internal node subdivision and the minimum sample number of the leaf nodes in the random forest model.

In a second aspect, the present invention further provides a load prediction system based on Bagging sampling and improving a random forest algorithm, including:

and a data acquisition module: for obtaining a historical load value;

load prediction module: the method comprises the steps of inputting a historical load value into a constructed random forest model to obtain a prediction result;

the model building unit is used for building a random forest model through the following method:

training each training set to obtain a corresponding CART decision tree;

all CART decision trees are assembled together to form a random forest model.

In a third aspect, the invention also provides a load prediction device based on Bagging sampling and improved random forest algorithm, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is operative according to the instructions to perform the steps of the method according to any one of the first aspects.

In a fourth aspect, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.

Compared with the prior art, the invention has the following beneficial effects:

according to the load prediction method, system, device and storage medium based on Bagging sampling and improved random forest algorithm, which are provided by the invention, based on the Bagging sampling method, a CART decision tree is constructed, abnormal and redundant information is processed more accurately and comprehensively during data processing, effective data is extracted for next operation, and the calculated amount is further reduced; then, a random forest algorithm is adopted, a specific sample is repeatedly extracted in a training sample set through historical data, a new training sample set is obtained for training, different CART decision trees are obtained through the sets, the CART decision trees obtained through the method are different, then a random forest is formed, the new input division is determined by the comprehensive result after each CART decision tree division, the decision trees are mutually incoherent, the flexibility is good, and the prediction model jointly built by the combined algorithm has the advantages of each algorithm, so that the short-term load prediction of the power system is more intelligent, and the accuracy of the short-term load prediction of the power system can be effectively improved.

Drawings

FIG. 1 is a flow chart of a load prediction method based on Bagging sampling and an improved random forest algorithm provided by an embodiment of the invention;

fig. 2 is a schematic diagram of a Bagging sampling method provided by an embodiment of the present invention;

FIG. 3 is a flowchart of a Bagging sampling method provided by an embodiment of the present invention;

FIG. 4 is a block diagram of a random forest algorithm provided by an embodiment of the present invention;

fig. 5 is an example diagram of short-term load prediction for a power system according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and the following examples are only for more clearly illustrating the technical aspects of the present invention, and are not to be construed as limiting the scope of the present invention.

Example 1

As shown in fig. 1, the load prediction method based on Bagging sampling and improved random forest algorithm provided by the embodiment of the invention includes:

the invention discloses a short-term load prediction method of an electric power system, which is based on data such as maximum temperature, minimum temperature, average temperature, relative humidity, rainfall and the like, and based on a Bagging sampling method, a CART decision tree is constructed and a random forest algorithm is applied on the basis of predicting the load value of 96 points in the day before and the load value of 96 points in the seven days before, and the weather condition of the day. Four links on the figure are respectively described as follows:

s1, randomly not replacing sampling by using a Bagging algorithm to generate a training set, wherein a schematic diagram of the Bagging sampling method is shown in FIG. 2, and a specific method of a flow chart shown in FIG. 3 is as follows:

randomly drawing N training samples (d 1, d2,) from an original sample set by adopting a Bootstrap method, and executing N times of circulation to obtain N training sets, wherein each training set is mutually incoherent;

each training set is capable of being trained to obtain a model. n training sets obtain n decision tree models;

for each input variable x classification problem, the result is determined by voting from the results of n decision trees.

S2, generating a meta decision tree classifier in a random forest, and constructing a CART decision tree by using a training set, wherein the method comprises the following specific steps of:

the original set of samples is divided into two subsets using the CART algorithm so that there are two branches on each non-leaf node. When nodes are split, the splitting rule is according to the minimum rule of the Gini index, and the following formula is a specific formula step flow of the Gini index formula of probability distribution:

/>

the total number of attributes of the data at the time of K in the formula, p _k Is the probability of belonging to the k-class attribute feature samples in the node.

The Gini formula used to calculate the sample set D is as follows:

wherein C is _k Is the sample set that is categorized among the set of kth classes.

The Gini index divided by each node is represented by:

The CART algorithm (which is a binary recursive partitioning method) can also process data containing missing values or outliers, and the binary tree generated by the CART algorithm is simple and easy to understand and has higher precision. The method can process discrete variables and continuous variables, and has wide application.

S3, setting parameters, constructing a random forest model and an algorithm, wherein a structure diagram of the random forest algorithm is shown in FIG. 4, and the specific steps of constructing the random forest model and the algorithm are as follows:

extracting N samples from the original samples by adopting a Bootstrap method to form training sets, carrying out N times to obtain N groups of training sets which are not mutually related, obtaining N CART decision trees by training the N training sets, and independently taking out the samples which are not extracted to construct m pieces of out-of-bag data;

each CART decision tree randomly selects m attributes corresponding to the classification attribute of each node, and the corresponding optimal attribute is selected according to the information quantity of the classification attribute to split each CART decision tree until the leaf node is split through the optimal feature;

and (3) assembling the CART decision trees obtained through training to form a random forest model, and carrying out test classification on the plurality of CART decision trees according to a preset proportion to obtain final classification. And (5) averaging the regression problems to obtain a final prediction result.

In addition, the specific steps of the setting parameters in S3 are as follows:

six super parameters are chosen in the experiment, which are respectively the characteristic evaluation standard, the maximum weak learner number, the maximum characteristic number, the maximum depth of the decision tree, the minimum sample number required by internal node subdivision, the minimum sample number of leaf nodes, and the like.

The characteristic evaluation standard is an index for measuring the splitting standard, and generally, the mean square error, the average absolute value error and the like can be selected; the number of the maximum weak learners represents the total amount of trees in the forest; the maximum feature number represents the number of attributes considered when the training tree is the best split node; the maximum depth of the decision tree represents the depth to which each tree can be split at most; the minimum number of samples required for internal node subdivision refers to the minimum number of samples required for splitting the internal node; the minimum number of samples for a leaf node represents the minimum number of samples that should be present on the leaf node.

The super-parameter default is the average absolute error at the beginning, the maximum number of weak learners is 800, the maximum depth of the decision tree is 60, the minimum number of samples required by internal node subdivision is 4, and the minimum number of samples of leaf nodes is 4.

S4, inputting 96-point load values, obtaining a predicted result by calculating an average value through a plurality of tree predicted values, wherein an example diagram of short-term load prediction of the power system is shown in FIG. 5, and the specific steps are as follows:

in the experiment, the data of the highest temperature, the lowest temperature, the average temperature, the relative humidity, the rainfall and the like are extracted as input by predicting the load value of 96 points in the day before and the load value of 96 points in the seven days before, and the weather condition of the day, and the date variable of 1-dimension mark working day/weekend is added, and the total 198-dimension variable is calculated, and the output variable is the 96-dimension variable of 96 points in the day to be predicted. A total of 250 days of data was used, 90% as training set and 10% as test set when building random forest model.

And then, optimizing parameters of the estimation function by a cross-validation method by using a grid search method to obtain an optimal learning algorithm. And (3) arranging and combining possible values of the parameters, and listing all possible combination results to generate a grid. Each combination was then used for SVM (support vector machine) training and performance was evaluated using cross-validation. After the fitting function attempts all parameter combinations, it returns to a proper classifier and automatically adjusts to the optimal parameter combination. After the optimal parameters are found by the method, the optimal parameters are obtained, wherein the characteristic evaluation standard is average absolute error, the number of the maximum weak learners is 1800, the maximum depth of the decision tree is 20, the minimum number of samples required by internal node subdivision is 2, and the minimum number of samples of leaf nodes is 2. And finally, obtaining a prediction result through operation.

The theoretical basis of the points of the random forest algorithm in the step S3 is specifically as follows:

(1) Edge function:

random forests are formed by a series of decision trees h (x, θ) _k ) The composition, then the edge function expression is:

wherein X is an input vector whose maximum threshold contains J different types, where J is one of the attribute categories of J; the definition of Y is the correct classification vector; av (ave) _k (g) Is an averaging function; i (g) is an indicator function.

(2) Generalization error:

the generalization error expression of the forest is then:

PE ^· ＝P _X,Y (K(X,Y)＜0)

wherein P is _X,Y Is a classification error rate function for a given input X.

K (X, Y) < 0 represents the misclassification of the test input X, and the generalization error represents the probability of misclassification of the input by the model. Namely, the generalization error can reflect the quality of the classification result of the random forest on the test sample. If the generalization error is smaller, the expected error of the mathematical model is smaller, and the classification result is better.

(3) Intensity:

the classification performance of the random forest is affected by the classification performance of the meta classifier, namely the comprehensive value of the classification performance of the meta classifier is the strength of the random forest. If the performance of all meta-classifiers is good, then the classifiability of the random forest will also be better because of the good performance of the meta-classifier. The intensity of the random forest is:

S＝E _X,Y (K(X,Y))

wherein E is _X,Y Is a desired function.

(4) Generalization error upper bound:

where s is the average intensity of the decision tree.

The smaller the upper bound of the generalization error, the better the generalization thereof. The above equation illustrates that the computation of the generalization error maximum is independent of the two characteristics of the decision tree. The larger the average intensity of the decision tree, the smaller the average correlation coefficient, and the better the generalization capability of the random forest. This suggests that we can improve the prediction accuracy of random forests by increasing the strength of the meta-classifier decision tree and decreasing the average correlation coefficient.

Example 2

The embodiment of the invention provides a load prediction system based on Bagging sampling and an improved random forest algorithm, which comprises the following components:

and a data acquisition module: for obtaining a historical load value;

training each training set to obtain a corresponding CART decision tree;

all CART decision trees are assembled together to form a random forest model.

Example 3

The load prediction device based on Bagging sampling and improved random forest algorithm provided by the embodiment of the invention comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate according to the instructions to perform the steps of the method of:

acquiring a historical load value;

the random forest model is constructed by the following method:

training each training set to obtain a corresponding CART decision tree;

all CART decision trees are assembled together to form a random forest model.

Example 4

The embodiment of the invention provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the method of:

acquiring a historical load value;

the random forest model is constructed by the following method:

training each training set to obtain a corresponding CART decision tree;

all CART decision trees are assembled together to form a random forest model.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A load prediction method based on Bagging sampling and an improved random forest algorithm is characterized by comprising the following steps:

acquiring a historical load value;

the random forest model is constructed by the following method:

training each training set to obtain a corresponding CART decision tree;

all CART decision trees are assembled together to form a random forest model.

2. The Bagging sampling and refinement random forest algorithm-based load prediction method according to claim 1, wherein the historical load values include 96 point load values and environmental data each of day before day and seven days before prediction.

3. The method for predicting the load based on Bagging sampling and improving random forest algorithm according to claim 1, wherein the generating a plurality of training sets by Bagging algorithm from the original sample set randomly without replacing the sampling comprises:

4. The loading prediction method based on Bagging sampling and improved random forest algorithm according to claim 1, wherein the training each training set to obtain a corresponding CART decision tree comprises:

5. The load prediction method based on Bagging sampling and improved random forest algorithm according to claim 1, wherein in the established random forest model, historical load values are tested and classified through a plurality of CART decision trees, and final classification is obtained according to a preset proportion, so that a prediction result is obtained.

6. The load prediction method based on Bagging sampling and improving random forest algorithm according to claim 1, wherein in the process of constructing the random forest model, the method further comprises the steps of setting parameters:

7. A Bagging sampling and random forest algorithm improvement-based load prediction system, comprising:

and a data acquisition module: for obtaining a historical load value;

training each training set to obtain a corresponding CART decision tree;

all CART decision trees are assembled together to form a random forest model.

8. The load prediction device based on Bagging sampling and improved random forest algorithm is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1 to 6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.