CN104881706B

CN104881706B - A kind of power-system short-term load forecasting method based on big data technology

Info

Publication number: CN104881706B
Application number: CN201410851910.0A
Authority: CN
Inventors: 张沛
Original assignee: Tianjin Hongyuan Huineng Technology Co Ltd
Current assignee: Tianjin Hongyuan Huineng Technology Co Ltd
Priority date: 2014-12-31
Filing date: 2014-12-31
Publication date: 2018-05-25
Anticipated expiration: 2034-12-31
Also published as: CN104881706A

Abstract

The present invention provides a kind of power-system short-term load forecasting method based on big data technology, using the load prediction of data mining technology implementation user class, and adds up and forms system loading, comprises the following steps：The similar load curve of shape feature is classified as one kind by load curve cluster analysis；Key influence factor is established, reaches yojan classifying rules, the purpose of Simplified prediction model；Classifying rules is established, using CART decision Tree algorithms, obtains Agglomerative Hierarchical Clustering analysis result；Day to be predicted is classified；Training prediction model is simultaneously predicted, according to the classification results of the day to be predicted drawn, corresponding supporting vector machine model is selected to complete prediction；The computing system load step is completed in Hadoop big data computing platforms.The present invention studies a kind of load prediction frame for user class, and excavates user power utilization Behavior law using data digging method, improves the precision of load prediction.

Description

Power system short-term load prediction method based on big data technology

Technical Field

The invention relates to the technical field of power system engineering, in particular to a power system short-term load prediction method based on a big data technology.

Background

The short-term load prediction result of the power system is related to the formulation of the scheduling operation and production plan of the power system, and the accurate short-term load prediction result is helpful for improving the safety and stability of the system and reducing the power generation cost. With the massive access of distributed energy sources (solar energy, wind energy, energy storage and the like) in the power system, the change rule of the load is more difficult to grasp, and the uncertainty increases the difficulty of load prediction of the power system. Therefore, a prediction method capable of better grasping the load change rule is needed.

The users are the most basic components in the power grid and are also the source of the power grid load fluctuation. However, the current load prediction methods are directed to system-level load prediction, and most deeply, bus-level prediction. Therefore, it is necessary to research a load prediction framework for user level and utilize a data mining method to discover the power utilization behavior rules of users, so as to improve the accuracy of load prediction.

Disclosure of Invention

The invention provides a power system short-term load prediction method based on a big data technology, which effectively solves the problem of low load prediction precision caused by complex power utilization rules of users.

In order to achieve the purpose, the invention adopts the technical scheme that: a power system short-term load prediction method based on big data technology comprises the following steps:

(1) And (3) clustering analysis of load curves: performing aggregation hierarchical clustering analysis on historical load data of a year before the day to be predicted by taking the day as a unit, and classifying load curves with similar shape characteristics into one class;

(2) Establishing key influencing factors: calculating a grey correlation analysis result by combining historical load and weather data, and sorting the result to obtain key influence factors influencing the load;

(3) Establishing a classification rule: taking the hierarchical clustering analysis result and key influence factors as input, and establishing a decision tree by adopting a CART algorithm to obtain an aggregation hierarchical clustering analysis result;

(4) Classifying days to be predicted: inputting the key factor day feature vector data of the day to be predicted into a decision tree to obtain a classification result of the day to be predicted;

(5) Training a prediction model and predicting: selecting historical load data in the corresponding class to train a support vector machine model, and selecting a corresponding support vector machine model to complete prediction according to the classification result of the day to be predicted obtained in the step 4;

(6) Calculating the system load: and (4) aiming at all the users in the predicted target power grid, repeating the steps, and accumulating all user loads and overlapping the grid loss load to obtain the system level load of the whole power grid.

Further, the step (1) specifically comprises the following steps:

the adopted clustering analysis algorithm is an improved coacervation hierarchical clustering algorithm, and the maximum value normalization is carried out on the difference value of each dimension in the Euclidean distance, which is shown as the following formula:

wherein, each day is a load sequence, and n represents that the load sequence is an n-dimensional vector, usually 96-dimensional; d ₁₂ Representing the spatial distance of the load sequence 1 and the load sequence 2; x in the distance _1k Representing the kth dimension of data, x, in the first payload sequence _2k Representing the kth dimension data in the second payload sequence; x is the number of _max Represents the maximum value in the kth dimension of all load sequences.

The historical load data of the year before the day to be predicted is used as a historical data set, the hierarchical clustering algorithm applying Euclidean distance improvement is adopted, load data of n points per day form a vector, and the normalized Euclidean distance between the vectors is calculated, so that the vectors are gradually classified into a plurality of classes with similar trends from independent samples in scattered distribution.

Further, the step (2) specifically comprises the following steps:

calculating the gray relevance of each factor by adopting a gray relevance analysis algorithm, taking the historical load data, meteorological data and day type data set which predicts the year before the day as an analysis sample, setting a mother sequence as a load value, and setting the weather factor and the day type as a plurality of subsequences, analyzing the relevance of each subsequence and the mother sequence by adopting the gray relevance analysis algorithm, finally, averaging the gray relevance of each influencing factor every year to obtain the gray relevance of each influencing factor, sequencing the gray relevance, and selecting the first 4 with larger value as key influencing factors influencing the load, wherein the method specifically comprises the following steps:

(a) Determining a normalized attribute matrix;

the historical load data value is the mother sequence Y = { Y = { Y = } ₁ ,y ₂ ,…,y _p } ^T The key influencing factor corresponding to the factor is the subsequence X _i ＝{x _1i ,x _2i ,…,x _pi } ^T Then the matrix can be obtained as follows:

wherein p represents p samples, q represents q influencing factors to be analyzed, x represents a factor sequence, and y represents a load sequence.

(b) The a matrix is then normalized by the mean value as follows,

in the formula, x _i (t) represents the value at time t of the ith factor, d _i An averaging operator representing each sequence,represents the average of each column element. The mother sequence Y is also normalized according to the same principle, and the averaging operator is recorded as D;

(c) The A matrix is normalized as follows:

(d) Calculating the correlation coefficient

Factor X _i And the t index of the load sequence Y, and a correlation coefficient ξ between the t index and the t index of the load sequence Y _i The geometric meaning of (t) is curve X _i The relative difference from the curve Y at the time t is calculated as follows:

in the formula,. DELTA. _max Is | m _k (t)-e _i Maximum value of (t) |, Δ _min Is | m _k (t)-e _i (t) | minimum value; | m _k (t)-e _i (t) | is the value at time t; rho is a resolution coefficient, has the function of improving the difference between the correlation coefficients, is generally selected from 0 to 1, and is usually rho =0.5;

(e) Determining key contributor rankings

On the basis of the above-mentioned correlation coefficient, the factor X can be calculated _i The degree of association with the load Y is:

generally, the grey correlation value r _i Values between 0 and 1, the closer the value is to 1, the greater the degree of linear correlation between variables X, Y, r _i The closer the absolute value of (A) is to 0, the more no linear correlation between X, Y is represented; 0<r _i &1, indicating that X, Y has a correlation, but a nonlinear relationship; | r _i | > 0.6, regarded as highly correlated; r is more than or equal to 0.2 _i |&0.6, considered moderately correlated; | r _i |&And lt, 0.2, the correlation is considered to be extremely weak and can be ignored.

Further, the step (3) specifically includes the following steps:

the adopted algorithm is a CART decision tree algorithm, the key influence factor with the minimum Gini index is selected at each node except leaf nodes, and the historical load data set of the current node is divided into two subsets until the final classification result is matched with the clustering result in the step 1. The process completes the learning of the coupling relation between the historical load and the key influence factor data and the clustering result, and can clearly and perfectly represent the classification rules.

Further, the step (5) specifically includes the following steps:

aiming at the classification result of the step 1, constructing a training sample by using load data of each class and corresponding key factor data, training a plurality of support vector machine models, selecting an RBF kernel function as the kernel function of the support vector machine, and selecting a grid optimization method, namely an exhaustion method, as a parameter optimization method;

and 4, selecting a corresponding support vector machine model to complete the prediction according to the classification result of the day to be predicted obtained in the step 4.

Further, the step (6) is completed on a Hadoop big data computing platform.

The invention has the advantages and positive effects that: the invention provides a power system short-term load forecasting method based on a big data technology, and the power utilization behavior rule of a user is excavated by using a data mining method, so that the load forecasting precision is improved, the safety and the stability of a power system are improved, and the power generation cost can be reduced.

Drawings

FIG. 1 is a schematic structural framework of the present invention;

FIG. 2 is a flow chart of the algorithm of the present invention;

FIG. 3 is a hierarchical clustering tree for user # 1;

FIG. 4 is a class 6 load graph for user # 1;

FIG. 5 is a big data computing platform framework diagram;

FIG. 6 is a diagram illustrating the effect of a method for predicting the short-term load of an electric power system based on big data technology.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 2, a method for predicting short-term load of an electric power system based on big data technology includes the following steps:

s1, inputting historical load data;

s2, clustering the input historical load data by using an improved hierarchical clustering algorithm;

the trend of the load curve is closely related to the type of day, weather factors and the like. Through the clustering analysis of the curves, the load curves with similar shape characteristics can be classified into one class.

The cluster analysis algorithm adopted by the invention is an improved coacervation hierarchical clustering algorithm. Meanwhile, the difference value of each dimension in the Euclidean distance is subjected to maximum value normalization, which is shown as the following formula:

wherein, each day is a load sequence, and n represents that the load sequence is an n-dimensional vector (usually 96-dimensional); d12 represents the spatial distance of the load sequence 1 and the load sequence 2; x1k in the distance represents the kth dimension data in the first loading sequence, and x2k represents the kth dimension data in the second loading sequence; xmax represents the maximum value in the kth dimension of all load sequences.

And taking the historical load data of the year before the day to be predicted as a historical data set. The hierarchical clustering algorithm improved by applying the standardized Euclidean distance is adopted, a vector is formed by load data of n points per day, and the standardized Euclidean distance between the vectors is calculated to gradually classify the vectors into a plurality of classes with similar trends from independent samples distributed scattered.

S3, simultaneously, the input historical load data, the weather information and the historical day and day type data are utilized to find out key influence factors by adopting a grey correlation analysis algorithm;

the invention adopts a grey correlation analysis algorithm to calculate the grey correlation degree of each factor (such as daily maximum air temperature, daily average air temperature, average humidity, daily type (day of week) and the like). And taking historical load data, meteorological data and a day type data set of the year before the forecast date as analysis samples, setting a mother sequence as a load value, and setting a weather factor and a day type as a plurality of subsequences. And analyzing the correlation between each subsequence and the parent sequence by adopting a gray correlation analysis algorithm, and finally averaging the gray correlation degrees of each day in one year to obtain the gray correlation degrees of each influence factor. And sequencing the grey correlation degrees, and taking the first 4 with larger selected values as key influence factors influencing the load.

S4, generating a decision tree by using a CART algorithm on the basis of the S2 and the S3;

and (3) selecting the key influence factor with the minimum Gini index at each node (except leaf nodes), and dividing the historical load data set of the current node into two subsets until the final classification result is consistent with the clustering result in the S1. The process completes the learning of the coupling relation between the historical load and the key influence factor data and the clustering result, and can clearly and perfectly represent the classification rules.

S5, forming N historical data sample sets according to the clustering result N of the S2;

s6, storing the classification rules represented by the decision tree generated in S4;

s7, training corresponding N x 96 support vector machine models (96 represent 96 sampling points of load data, so that each sampling point corresponds to one support vector machine model) according to the N historical data sample sets;

and (3) aiming at the classification result in the step (1), constructing a training sample by using the load data of each class and the corresponding key factor data, and training a plurality of support vector machine models. In the process, the kernel function of the support vector machine selects the RBF kernel function, so that the parameters of the support vector machine, which need to be determined under the kernel function, are the kernel function parameters delta ² An insensitivity coefficient epsilon and a penalty parameter c. The optimization method of the parameters adopts a grid optimization method, namely an exhaustion method.

S8, forming a day key influence factor vector to be predicted according to the key influence factor information of the day to be predicted;

s9, inputting the vector in the S8 into the classification rule in the S6 to obtain the category of the day to be predicted;

s10, according to the category of the day to be predicted in S9, selecting a corresponding model in S7 for prediction;

s11, performing the operation on all users in the target power grid;

s12, summing the prediction results of all users;

s13 outputs the summed prediction result, i.e. the system load.

Taking the city-level system load prediction of a certain city in Zhejiang province as an example, the city is provided with 13 220 kv-level transformer stations, and users share 120 households. Meanwhile, the city is provided with an electricity consumption information acquisition system, and 96 points/day of equidistant load sampling points of each user can be obtained.

Step 1 (corresponding to S1, S2): selecting a user #1, wherein the load data of the user to be predicted in 1 year before the day is in the following format:

each row of data in the table is a 96-dimensional data sample, and the normalized euclidean distance between every two vectors is calculated by using the following formula:

and merging samples with the shortest distance according to the calculation result, as shown in fig. 3, the bottom layer of the graph is 365 samples of the historical load data, and the samples are gradually merged from top to top and finally classified into 6 types. Six colors represent 6 classes, and fig. 4 is a graph of six classes of loads.

Step 2 (corresponding to S3) was performed: and finding out key influence factors influencing the load change by adopting a grey correlation analysis algorithm according to the historical load data, the historical weather factor data and the day type data. The data format is as follows:

the first 7 columns of the data table are taken as subsequences, the 8 th column is taken as a mother sequence, and the following grey correlation analysis algorithm is adopted for calculation:

applying a grey correlation analysis method to the determination of key factors influencing load change, comprising the following steps of:

(a) Determining a normalized attribute matrix;

the historical load data value is the mother sequence Y = { Y = ₁ ,y ₂ ,…,y _p } ^T The key influencing factor corresponding to the factor is the subsequence X _i ＝{x _1i ,x _2i ,…,x _pi } ^T Then the matrix can be obtained as follows:

(b) The a matrix is then normalized by the mean value as follows,

in the formula, x _i (t) represents the value at time t of the ith factor, d _i An averaging operator representing each sequence,represents the average of each column element. And normalizing the mother sequence Y according to the same principle, and recording an averaging operator as D.

(c) The A matrix is normalized as follows:

(d) Calculating the correlation coefficient

Factor X _i And the t index of the load sequence YCorrelation coefficient xi between targets _i The geometrical meaning of (t) is curve X _i The relative difference from the curve Y at the time t is calculated as follows:

in the formula,. DELTA. _max Is | m _k (t)-e _i Maximum value of (t) |, Δ _min Is | m _k (t)-e _i (t) | minimum value; | m _k (t)-e _i (t) | is the value at time t. ρ is a resolution coefficient, which is used to improve the difference between the correlation coefficients, and is generally selected from 0 to 1, and ρ =0.5 is usually taken.

(e) Determining key influencing factor ranking

generally, the grey correlation value r _i The value is between 0 and 1. The closer the value is to 1, the greater the degree of linear correlation between variables X, Y, r _i The closer the absolute value of (A) is to 0, the more no linear correlation between X, Y is represented; 0<r _i &1, indicating that X, Y has a correlation, but a nonlinear relationship; | r _i | ≧ 0.6, considered highly correlated; r is more than or equal to 0.2 _i |&0.6, considered moderately correlated; | r _i |&And lt, 0.2, the correlation is considered to be extremely weak and can be ignored.

Through the above calculation, the gray relevance degrees of the 6 influencing factors of the user #1 are obtained as follows:

and judging that the average precipitation and the average wind speed are extremely weak correlation factors, and the rest maximum air temperature, average humidity and day type are key influence factors influencing the load trend.

Step 3 (corresponding to S4, S6): and generating a decision tree by using a CART algorithm according to the 2012-year load data clustering result of the user #1 and the key influence factors. The data format is as follows:

user #1	Maximum air temperature	Mean temperature	Average humidity	Type of day	Clustering results
						2012/01/01	7.7	3.65	75.2	7	1
2012/01/02	7.9	2.89	68.6	1	4
						…	…	…	…	…	…
2012/12/31	6.6	0.09	61.3	1	5

Description of data format: the top 4 columns in the table are 4 key influence factors found in the implementation step 2 and influencing the load trend of the user #1, and the clustering result represents which category of the 6 categories the historical load curve is clustered into in the implementation step 1.

Taking the last column of the table as the final leaf node type of the decision tree, taking the four items of key influence factor data as a candidate set for splitting the nodes of the decision tree, and calculating which value of which attribute is adopted as the optimal classification attribute when each node is classified, wherein the algorithm is as follows:

the CART decision tree is a binary recursive partitioning technique that partitions the current sample set into two subsets at each node (except the leaf nodes). Unlike the information gain method based on information theory, the attribute selection metric used by the CART algorithm is the Gini index (Gini index). The kini index is used to measure the impurity degree of the training sample set D, and assuming that the data set D includes m classes, the calculation formula of the kini index is:

wherein p is _j Is the frequency of occurrence of the class j element. The kini index requires consideration of a binary partition of each attribute, assuming that a binary partition of an attribute divides D into D ₁ And D ₂ Then, the kini index of the sample set D divided by some attribute a at the child node this time is:

for each attribute, considering each possible binary partition, the subset of the smallest kini indices that the attribute produces is finally selected as its split subset. Therefore, the Gini index on the attribute A can be found from the above formula _A (D) The smaller the size, the better the partitioning effect on the attribute a. Under the rule, the tree is continuously split from top to bottom until the growth of the whole decision tree is completed. And storing the decision tree as the classification rule.

Step 4 (corresponding to S5, S7):

the implementation of step 4 can be performed simultaneously with the implementation of step 2, and this step mainly completes the training of the prediction model. According to the clustering result in the implementation step 1, sorting the corresponding historical data of each class into a sample set, wherein the format is as follows:

description of data format: the table is the data in the first class result, and because the support vector machine model requires one model per data dimension, the class requires 96 support vector machine models to be trained and stored.

Step 5 (corresponding to S8, S9 and S10) is performed: the load values for 96 points throughout the day of 29 days in 4 months in 2013 are now selected as the predicted objects. The weather forecast information and the day type information of the day are as follows:

user #1	Maximum air temperature	Mean temperature	Average humidity	Type of day
					2013/04/29	21℃	12.75℃	54.3％	6 (Saturday)

Inputting the vector into the classification rule established in the third step to obtain a classification result of the 2 nd class. And then selecting the support vector machine model corresponding to the second type established in the implementation step 4 to carry out 96-point load prediction, and outputting and storing the result.

Step 6 (corresponding to S11, S12 and S13) is performed:

the implementation process of the step is completed on a Hadoop big data computing platform, and the Hadoop big data computing platform is an open-source data platform. The most core designs in the Hadoop framework are HDFS and MapReduce, and FIG. 5 is a framework diagram of a large data platform. HDFS provides storage of mass data, and MapReduce provides parallel computation of the data. The big data platform we use contains 4 servers, each server is configured with two E5-2630V2 CPUs and 500G of storage space. The operations from step 1 to step 5 are performed on 120 general households in the city of Zhejiang province, and the calculation time is shown in the following table:

and accumulating the load prediction results of 120 ten thousand users to obtain the final system load. The predicted results are shown in FIG. 6.

The maximum relative error of the traditional method is 3.36 percent, the minimum relative error is 0.51 percent, and the average relative error is 1.68 percent; the prediction result obtained by the method is that the maximum relative error is 1.35%, the minimum relative error is 0.07%, and the average relative error is 1.68%.

The embodiments of the present invention have been described in detail, but the description is only for the preferred embodiments of the present invention and should not be construed as limiting the scope of the present invention. All equivalent changes and modifications made within the scope of the present invention should be covered by the present patent.

Claims

1. A method for predicting short-term load of a power system based on big data technology comprises the following steps:

(1) And (3) clustering analysis of load curves: performing aggregation hierarchical clustering analysis on historical load data of the year before the day to be predicted by taking day as a unit, and classifying load curves with similar shape characteristics into one class;

(2) Establishing key influencing factors: calculating a grey correlation analysis result by combining the historical load and the weather data, and sorting the result to obtain key influence factors influencing the load;

(3) Establishing a classification rule, taking a hierarchical clustering analysis result and key influence factors as input, and establishing a decision tree by adopting a CART algorithm to obtain an aggregation hierarchical clustering analysis result;

2. The method for predicting the short-term load of the power system based on the big data technology as claimed in claim 1, wherein: the step (1) specifically comprises the following steps:

wherein, each day is a load sequence, and n represents that the load sequence is an n-dimensional vector, usually 96-dimensional; d ₁₂ Representing the spatial distance of the load sequence 1 and the load sequence 2; x in the distance _1k Representing the kth dimension of data, x, in the first payload sequence _2k Representing the kth dimension data in the second payload sequence; x is the number of _max Represents the maximum value in the k-dimension of all the load sequences;

the historical load data of the year before the day to be predicted is used as a historical data set, the hierarchical clustering algorithm improved on the Euclidean distance is adopted, load data of n points per day form a vector, and the normalized Euclidean distance between the vectors is calculated, so that the vectors are gradually classified into a plurality of classes with similar trends from independent samples distributed scattered.

3. The method for predicting the short-term load of the power system based on the big data technology as claimed in claim 1, wherein: the step (2) specifically comprises the following steps:

calculating the grey correlation degree of each factor by adopting a grey correlation analysis algorithm, taking historical load data, meteorological data and a day type data set of a year before the forecast date as analysis samples, setting a mother sequence as a load value, and setting weather factors and day types as a plurality of subsequences; analyzing the correlation between each subsequence and the parent sequence by adopting a gray correlation analysis algorithm, and finally averaging the gray correlation degrees of each day in one year to obtain the gray correlation degree of each influence factor; the grey relevance degrees are sorted, the first 4 with larger values are selected as key influence factors influencing the load, and the specific steps are as follows:

(a) Determining a normalized attribute matrix;

the historical load data value is the mother sequence Y = { Y = { Y = } ₁ ,y ₂ ,···,y _p } ^T The key influencing factor corresponding to the factor is the subsequence X _i ＝{x _1i ,x _2i ,···,x _pi } ^T Then the matrix can be obtained as follows:

in the formula, p represents p samples, q represents q influencing factors to be analyzed, x represents a factor sequence, and y represents a load sequence;

(b) The a matrix is then normalized by the mean value as follows,

in the formula, x _i (t) represents the value at time t of the ith factor, d _i An averaging operator representing each sequence,represents the average of each column element; the mother sequence Y is also normalized according to the same principle, and the averaging operator is recorded as D;

(c) The A matrix is normalized as follows:

(d) Calculating the correlation coefficient

in the formula,. DELTA. _max Is | m _k (t)-e _i Maximum value of (t) |, Δ _min Is | m _k (t)-e _i (t) | minimum value; | m _k (t)-e _i (t) | is a value at the time t, and ρ is a resolution coefficient, which has the effect of improving the difference between the correlation coefficients, and is generally selected from 0 to 1, and usually ρ =0.5;

(e) Determining key influencing factor ranking

generally, the grey correlation value r _i Values between 0 and 1, the closer the value is to 1, the greater the degree of linear correlation between variables X, Y, r _i The closer the absolute value of (A) is to 0, the more no linear correlation between X, Y is represented; 0<r _i &1, representing that X, Y has a relevant relationship but a nonlinear relationship; | r _i | ≧ 0.6, considered highly correlated; r is more than or equal to 0.2 _i |&0.6, considered moderately correlated; | r _i |&And lt, 0.2, the correlation is considered to be extremely weak and can be ignored.

4. The method for predicting the short-term load of the power system based on the big data technology as claimed in claim 1, wherein: the step (3) specifically comprises the following steps:

the adopted algorithm is a CART decision tree algorithm, the key influence factor with the minimum Gini index is selected at each node except leaf nodes, the historical load data set of the current node is divided into two subsets until the final classification result is matched with the clustering result in the step 1, the process finishes learning the coupling relation between the historical load and the key influence factor data and the clustering result, and the classification rule can be clearly and perfectly represented.

5. The method for predicting the short-term load of the power system based on the big data technology as claimed in claim 1, wherein: the step (5) specifically comprises the following steps:

aiming at the classification result in the step 1, constructing a training sample by using the load data of each class and the corresponding key factor data, and training a plurality of support vector machine models; the kernel function of the support vector machine selects RBF kernel function, and the optimization method of the parameters adopts a grid optimization method, namely an exhaustion method;

and 4, selecting a corresponding support vector machine model to complete prediction according to the classification result of the day to be predicted obtained in the step 4.

6. The method for predicting the short-term load of the power system based on the big data technology as claimed in claim 1, wherein: and (6) finishing the step on a Hadoop big data computing platform.