CN105469219A

CN105469219A - Method for processing power load data based on decision tree

Info

Publication number: CN105469219A
Application number: CN201511021630.8A
Authority: CN
Inventors: 沈培锋; 余昆; 宁艺飞; 陈星莺; 嵇文路; 周冬旭; 王春宁; 罗兴
Original assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Hohai University HHU; Nanjing Power Supply Co of Jiangsu Electric Power Co
Current assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Hohai University HHU; Nanjing Power Supply Co of Jiangsu Electric Power Co
Priority date: 2015-12-31
Filing date: 2015-12-31
Publication date: 2016-04-06

Abstract

The invention provides a method for processing power load data based on a decision tree. According to the invention, missing attribute values are complemented by adopting the sample similarity principle, and are applied in the power load prediction, so that the accuracy of the historical load data and the precision of the power load prediction result are improved, the feasibility and accuracy of the method is verified through an example simulation analysis, and the method has a certain practical value.

Description

A kind of Power system load data disposal route based on decision tree

Technical field

The present invention proposes a kind of Power system load data disposal route based on decision tree, belongs to network load prediction field.

Background technology

Load forecast is a very important job in electric power dispatching system.Load prediction is predicted according to demand history data and other all kinds of Correlative Influence Factors.Therefore its precision of prediction depends on the accuracy of historical data to a great extent.

Existing Methods of electric load forecasting adopts data mining technology.Data mining technology uses under all known prerequisite determined of all properties value, and in a lot of situation, especially enterprise of big companies all can gather hundreds of millions of information datas every day, often there will be some property value Loss of some sample.Because property value and this sample are not associated, or record is not carried out to it during collecting sample, or be the mistake caused people during data inputting database, thus occur sample attribute value deficient phenomena.If the data with missing values removed from infosystem, not only can cause the excess waste of resource, also may lose and lie in wherein, lost, valuable information, thus the rule sought by obliterated data digging technology.But, incorrect process is carried out to attribute missing values and can bring new noise pollution, make data mining technology produce the result of mistake, analysis is had an impact.It is imperfect or inconsistent that data in real world often there will be data, and data contain noise situations, and data prediction can improve the quality of data, improves validity and the accuracy of data mining process.High-quality decision-making technique is from high-quality data.Therefore, how correctly to process missing data is very important problem in data mining technology preprocessing process, and be also the committed step of whole data mining and Knowledge Discovery, the analysis result more to final is most important.

" dividing and rule " method of decision tree to be developed by the J.R.Quinlan of University of Sydney, Australia and perfect.He in 1986 on machine learning magazine dispatch describe ID3 algorithm, this algorithm, based on information entropy theory, is the earliest and the most influential decision Tree algorithms at that time.This algorithm is the choice criteria using information gain as testing attribute, but tends to the attribute of many values due to information gain tolerance, the attribute that the more attribute of value is not necessarily best, so this algorithm exists certain deviation and mistaken ideas; The attribute with discrete value can only be processed, do not consider the missing value problem in training set, so ID3 algorithm is further improved.C4.5 algorithm is the improvement on ID3 algorithm basis, not only can process discrete value attribute, can also process Continuous valued attributes.C4.5 algorithm adopts information gain-ratio as the standard selecting testing attribute, and the computing method of information gain-ratio are as follows:

If S is a set comprising s data sample, category attribute can get n different value, just corresponds to the individual different classification C of n _i, i ∈ 1,2,3 ..., n}.Suppose s _ifor classification C _iin number of samples, the quantity of information needed for so will classifying to a data-oriented object is:

I (s_{1}, s_{2}, ..., s_{n}) = - Σ_{i = 1}^{n} p_{i} \log_{2} p_{i} - - - (1)

In formula, p _ithat any one data object belongs to classification C _iprobability, can by s _i/ s calculates; I (s ₁, s ₂..., s _n) be the quantity of information of sample, namely the information of sample attribute is expected.

If attribute A has m different value, be respectively a ₁, a ₂..., a _m, with attribute A, S can be divided into m subset, be respectively S ₁, S ₂..., S _m, wherein S _jcomprise attribute A in S set and get a _jthe data sample of value.If A is selected as testing attribute, if s _ijfor subset S _jin belong to C _isample number.By the information entropy of A dividing subset be then:

E (A) = Σ_{j = 1}^{m} p_{j} I (s_{1 j}, ..., s_{m j}) - - - (2)

In formula, the information entropy that E (A) is subset, p _jas the weights of a jth subset, it gets a by attribute A in all subsets _jthe sample data sum of value is divided by the total sample number in S set.And for a given subset S _j, its value of information is:

I (s_{1 j}, s_{2 j}, ..., s_{n j}) = - Σ_{i = 1}^{n} p_{i j} \log_{2} p_{i j} - - - (3)

In formula, p _ij=s _ij/ | S _j|, i.e. subset S _jin any one data sample belong to classification C _iprobability.Utilize attribute A to carry out to current branch node the information gain Gain (A) that sample set division obtains like this to be:

Gain(A)＝I(S ₁，S ₂，...，S _n)-E(A)(4)

The computing formula of information gain-ratio is:

G a i n R a t i o n (A) = \frac{G a i n (A)}{I (A)} - - - (5)

As can be seen here, what the information gain-ratio that C4.5 algorithm adopts represented is the ratio of the useful information produced by branch, and this value is larger, and the useful information that expression branch comprises is more.Although C4.5 algorithm is the improvement on ID3 algorithm, it is perfect not to the complementing method of missing attribute values.

Summary of the invention

Goal of the invention: the present invention proposes a kind of Power system load data disposal route based on decision tree, improves the accuracy of historical load data.

Technical scheme: the present invention proposes a kind of Power system load data disposal route based on decision tree, comprises the following steps:

1) determined value sample set is divided into the sample that attribute a certain in training set T has determined value;

2) similarity of missing values sample and determined value sample in calculation training collection T;

3) to have the sample attribute of the full missing values sample of determined value sample attribute value complement of maximum similarity with missing values sample.

Preferably, described similarity is:

D (s_{i}^{'}, s_{j}) = \frac{| A_{i j} |}{| A |} + δ_{i j}

δ_{i j} = \{\begin{matrix} 1, d (s_{i}) = d (s_{j}) \\ 0, d (s_{i}) &NotEqual; d (s_{j}) \end{matrix}

In formula, s _ja jth sample in determined value sample set, s ' _ii-th sample in missing values sample set, D (s ' _i, s _j) be s _jwith s ' _isimilarity; A represents all properties set in data training set, A _ij={ a ∈ A|a _i=a _jrepresent s _iand s _jidentical and the community set determined of value, | A| and | A _ij| represent the element number in corresponding set respectively, δ _ijfor weight coefficient.

Beneficial effect: the employing Sample Similarity principle that the present invention proposes carries out completion to missing attribute values, and apply it in load forecast, not only increase the accuracy of historical load data, also improve the precision of load forecast result, analyzed by Simulation Example, demonstrate feasibility and the accuracy of the method, there is certain practical value.

Embodiment

Below in conjunction with specific embodiment, illustrate the present invention further, these embodiments should be understood only be not used in for illustration of the present invention and limit the scope of the invention, after having read the present invention, the amendment of those skilled in the art to various equivalents of the present invention has all fallen within the application's claims limited range.

The present invention adopts Sample Similarity principle to carry out completion to attribute missing values, and the similarity size according to known sample data and disappearance sample data revises missing data, improves the accuracy of raw data, thus improves the precision of load forecast.

If A is a certain attribute of training set T, the value of A is: a ₁, a ₂..., a _m, definition s is determined value sample, and s ' is missing values sample.Subclass T ': T '={ s ∈ T|a is defined according to T _x≠ unknown number (x=1,2 ..., m) }, subclass T ' is expressed as attribute a _xall sample sets that value is determined.For the missing values sample s ' in data training set T and the similarity of the determined value sample s in subclass T ' be so:

D (s_{i}^{'}, s_{j}) = \frac{| A_{i j} |}{| A |} + δ_{i j}

δ_{i j} = \{\begin{matrix} 1, d (s_{i}) = d (s_{j}) \\ 0, d (s_{i}) &NotEqual; d (s_{j}) \end{matrix} - - - (6)

In formula, D (s ' _i, s _j) be the similarity with sample s; A represents all properties set in data training set, A _ij={ a ∈ A|a _i=a _jrepresent s _iand s _jidentical and the community set determined of value, | A| and | A _ij| represent the element number in corresponding set respectively, δ _ijfor weight coefficient.

With with s _jthere is the sample s in the subclass T ' of maximum similarity _jproperty value as s ' _iproperty value, completion missing values, leaves out the s ' of other nodes in decision tree simultaneously _i, until the missing values of all data supplement complete till.

Above-mentioned missing values completion principle is only applicable to the less situation of shortage of data value, and when when the data in database are less, missing values is more, the method may make analysis result produce deviation.But, if there is more property value deletion condition in the database with mass data, such data have lost the meaning and value of research, and in actual conditions, in acquisition of information, generally there will not be this situation.

Finally provide an example, as shown in table 1 is the historical load data in Jiangsu Province's on March 1st, 2013 to March 14, by above-mentioned Sample Similarity principle, completion is carried out to missing attribute values, and then utilize decision tree C4.5 algorithm to form decision tree, thus future electrical energy load is predicted.Provide concrete data below as shown in table 1, in table 1 "? " place represents this shortage of data.

Table 1: historical load data

First, objective attribute target attribute and conditional attribute is determined.Due to given data in table only have temperature, relative humidity, day type and load data, so rule of thumb can by the temperature in data, relative humidity and day type attribute be decided to be conditional attribute, load attribute is decided to be objective attribute target attribute.

Although a day type attribute is not continuous data, decision Tree algorithms can not identify this property value, must change it, and the attribute converting decision tree identification to could use.This paper numerical value 1,2,3,4,5,6,7 replaces Monday, Tu., Wednesday, Thursday, Friday, Saturday, Sun. respectively, so just the property value that decision tree can not identify is converted to the property value that can identify.

Secondly, from data in table, temperature, humidity and load data property value are continuous data, wherein temperature and relative humidity can directly apply in algorithm, because decision tree C4.5 algorithm can process continuous type property value, but load data is objective attribute target attribute, algorithm can not directly process, so need to carry out discretize to load data.Load is on average divided into four classes by the present invention, load data in example is all interval [42833,545412] in, so be four parts by interval division, i.e. four types: [42833,45760], [45760,48687], [48687,51614], [51614,54542], the present invention respectively with 1,2,3,4 replace these four types.

Finally, according to above-mentioned formula and method, MATLAB software is utilized to carry out programming simulation to decision tree C4.5 algorithm, by in the data substitution program after process, decision tree is obtained according to interpretation of result, according to decision tree formation rule, utilize these rules just can economize on March 15th, 2013 and carry out forecast analysis to the load on March 28 this.

Claims

1., based on a Power system load data disposal route for decision tree, it is characterized in that, comprise the following steps:

2. the Power system load data disposal route based on decision tree according to claim 1, it is characterized in that, described similarity is:

D (s_{i}^{'}, s_{j}) = \frac{| A_{i j} |}{| A |} + δ_{i j}

δ_{i j} = \{\begin{matrix} 1, d (s_{i}) = d (s_{j}) \\ 0, d (s_{i}) &NotEqual; d (s_{j}) \end{matrix}