CN102982489A

CN102982489A - Power customer online grouping method based on mass measurement data

Info

Publication number: CN102982489A
Application number: CN2012104847126A
Authority: CN
Inventors: 刘涛; 杨劲锋; 阙华坤; 肖勇; 孙卫明; 陈启冠; 王和栋; 张良均
Original assignee: Electric Power Research Institute of Guangdong Power Grid Co Ltd
Current assignee: Electric Power Research Institute of Guangdong Power Grid Co Ltd
Priority date: 2012-11-23
Filing date: 2012-11-23
Publication date: 2013-03-20

Abstract

The invention discloses a power customer online grouping method based on mass measurement data. The power customer online grouping method includes step1 extracting historical sample data of power customers; step2 preprocessing extracted sample data; step3 conducting initial customer grouping on the historical sample data of the power customer; step4 extracting user information of online power customers in real time from a metering automatic system and reflecting real-time power utilization data reflecting power utility characteristics of the online power customers and conducting preprocessing; step5 acquiring preprocessed online power customer data and utilizing generated cluster center points to conduct online real-time grouping on newly increased online customer user data on the base of the generated customer groups. The method is capable of conducting dynamic grouping calculation of all the power customers.

Description

The online grouping method of a kind of power customer based on the magnanimity continuous data

Technical field

The present invention relates to a kind of power industry power customer grouping method, specifically refer to the online grouping method of a kind of power customer based on the magnanimity continuous data.

Background technology

Aspect power marketing; is often can run into such problem: how to do well avoided the peak hour, the electricity consumption of fault outage, scheduled outage is instructed and the emergency service when having a power failure? how the user that avoids the peak hour is carried out science, flexible avoiding the peak hour? how to guarantee responsive client, high value customer; can the avoiding the peak hour of high credit worthiness client, scheduled outage information in time be sent to? all these problems need that all a kind of effective method is arranged, and the client hives off to electricity consumption.

Traditional power customer hives off and mainly takes static method, the static attribute data of normal operation power customer.Or electric quantity data per month carries out, and data volume seldom in case calculate completely, seldom changes.And in fact change during the user power utilization behavior, static grouping method can not satisfy the demand that the client is provided the personal marketing service, and each power supply unit is in the urgent need to a kind of new method, comes the calculating of hiving off dynamically of all power customers.

The general data that the present invention adopts derives from the metering automation system, and this system utilizes collecting terminal equipment and communication network, can the real-time Power system load data of Real-time Obtaining client.The continuous data amount is large, utilizes traditional data mining algorithm to be difficult to process, and goes to infer that large data often make people take the wrong turning if excavate small data with algorithm, and implementing very is that effort is time-consuming.For electric system dirigibility, portability being provided and reducing cost and need select cloud computing environment.

Summary of the invention

The purpose of this invention is to provide the online grouping method of a kind of power customer based on the magnanimity continuous data, the method can be to the calculating of hiving off dynamically of all power customers.

Above-mentioned purpose of the present invention realizes by following technical solution: the online grouping method of a kind of power customer based on the magnanimity continuous data may further comprise the steps:

Step 1: power consumer historical sample data are extracted;

Step 2: the sample data that extracts is carried out pre-service;

Step 3: power consumer historical sample data are carried out initial customer grouping;

Step 4: from the metering automation system, extract in real time online power consumer information and reflect the online power consumer real-time electricity consumption data of electrical feature, and carry out pre-service;

Step 5: obtain pretreated online power consumer data, on the basis of the customer grouping that step 3 has generated, utilize the cluster centre point that has generated, newly-increased online power consumer data are hived off online in real time.

As improvement of the present invention, the present invention also comprises step 6: to the performance evaluation of hiving off in real time online of step 5 acquisition.

Among the present invention, the power consumer historical sample data in the described step 1 advance to comprise customer profile information, client's information about power and client's information on load;

Data pre-service in the described step 2 comprises missing values processing, outlier processing, the data of hiving off computing and data normalization processing;

S2.1: missing values is processed

At original continuous data, find to exist the phenomenon of disappearance, for guaranteeing the validity of modeling data, need to carry out polishing to these missing datas and process;

S2.2: outlier processing

To exceeding the data of index threshold values scope, carry out correcting process by of the same type day data in conjunction with interpolation algorithm;

S2.3: the computational data index of hiving off

Consider that load fluctuation can characterize client's the electrical feature of using substantially, so calculate the index that reflects the load change situation in the certain hour section based on electric weight and load index:

Rate of load condensate=average load/peak load

Peak total ratio=peak electric weight/total electric weight

Flat always than the electric weight of=ordinary telegram amount/always

Paddy is always than=paddy electric weight/total electric weight

Wherein:

Peak load=M α x (L _i), i=1,2 ..., 96, L _iExpression was every 15 minutes power load sampled value; Peak, paddy, ordinary telegram amount are respectively the power consumption of city peak of power consumption time period, flat peak time section and paddy peak time section;

Rush hour, section referred to the peak of power consumption, and power consumption is relatively concentrated, and the low ebb time period is then opposite; Rush hour, section was 8 hours: 9:00～12:00,17:00～22:00; Flat 7 hours time periods of section: 8:00～9:00,12:00～17:00,22:00～23:00; 9 hours low ebb time periods: 23:00～next day 8:00;

S2.4: data normalization

In order to eliminate the otherness of hiving off between the index dimension, data are carried out normalized, main method can adopt the minimax value method, zero-mean method and decimal scaling method;

Described step 3 comprises following substep:

S3.1: sample data standardization

Data normalization refers to changing into vector data through a pretreated sample data, vector data comprise the big customer rate of load condensate, peak total ratio, flat always than and paddy always than;

Wherein d1 is rate of load condensate, and d2 is peak total ratio, d3 for flat always than, d4 be paddy always than;

Vector data is stored in the distributed file system, in standardized process, can pass through the MapReduce scheduler, according to the sample data file size split into some data block vector data angang than and paddy always than; Quantity according to data block starts Map tasks in parallel operative norm conversion work;

S3.2: distributed storage

Distributed file system adopts the master/slave framework; HDFS cluster is comprised of the Datanodes of a Namenode and some; Namenode is a central server, is in charge of the name space (namespace) of file system and client to the access of file; Datanode in the cluster is one of a node, is in charge of the storage on its place node; HDFS has opened the name space of file system, and the user can store data in the above with the form of file; See that internally a file is divided into one or more data blocks in fact, these pieces are stored on one group of Datanode; The operation of the name space of Namenode execute file system, such as open, close, Rename file or catalogue; It also is responsible for the specified data piece to the mapping of concrete Datanode node; Datanode is responsible for processing the read-write requests of file system client; Under the United Dispatching of Namenode, carry out data block establishment, delete and copy;

S3.3: cluster centre point initialization

Customer grouping mainly take can and the clustering algorithm of Distributed Calculation hive off, the below describes as an example of the K-means algorithm example;

Clustering algorithm at first generates empty cluster and numbering, concentrates from all sample datas and selects at random K object as the central point of K-means cluster, with the representative of cluster centre point as each cluster;

S3.4: iterative computation Optimal cluster centers point

By alternative manner, constantly calculate new cluster centre point, until all sample datas all and the distance between the central point minimum;

S3.5: export the data of hiving off

In previous step, drawn the cluster centre point by iterative computation constantly, also drawn the cluster centre under each sample data simultaneously, can directly export and get final product;

In the step 4, after the historical sample data initialization hives off, according to the practical application needs, regularly extract the real-time electricity consumption data that customer information and reflection client use electrical feature from the metering automation system, the described method of electricity consumption the data step 2 is carried out pre-service in real time;

Described step 5 comprises following substep:

On the basis of the customer grouping that step 3 has generated, utilize K the cluster centre point that has generated, adopt the Canopy algorithm that newly-increased data are hived off online in real time, concrete steps are as follows:

S5.1: according to the existing cluster centre point that hives off, generate K Canopy cluster, the center initial value of each cluster is the existing cluster centre point that hives off;

S5.2: specify suitable T1 and T2 parameter, all new datas are placed in the cluster of Canopy and carry out cluster calculation;

The Canopy algorithm at first can require to input two threshold values T1 and T2, T1〉T2; Algorithm has a cluster the S set et of Canopy, and it is empty when initial; Then first that reads can be put as a Canopy in the set, then read next point, the distance of each Canopy in calculating this point and gathering, if this distance is less than T1, then this point can be distributed to this Canopy, and when this distance during less than T2 this point can not be put in the set as a new Canopy;

S5.3: calculate new sample data to the distance B of each central point according to the Canopy algorithm, when D＜T1, just this sample data is put in the corresponding cluster, when D＜T2, then this sample data is deleted from new sample set, if D1-DK is〉T1, then this point can originally be generated as a new central point, thereby forms New Consumers clustering class; Cycle calculations is until all new samples data sets are sky.

The real-time cluster of electricity consumption client continuous data is an important content in the customer behavior analysis, can set up a lot of correlation models (such as classification recurrence, time series forecasting, association analysis and specificity discovery etc.) based on this.The online Clustering Model of metering automation system magnanimity is according to the input of real time measure data and the importance of parameter, according to cloud computing environment, be a plurality of classifications with the metering user data subdividing, provide all kinds of with results such as electrical feature, accounting and distributions, according to these Output rusults, can carry out differentiation to each class client and process.

The present invention builds on cloud computing technology, adopt the technology such as distributed storage, distributed index, distributed parallel calculating, can effectively carry out tissue, storage, index and the management of mass data, and the function such as inquiry, analysis of mass data is provided with standardized application or service interface.

Description of drawings

Fig. 1 is the system flowchart of the online grouping method of the present invention;

Fig. 2 is based on the k-means cluster process flow diagram of Distributed Calculation in the online grouping method of the present invention

Fig. 3 is the MapReduce parallelization implementation of k-means cluster in the online grouping method of the present invention

Fig. 4 is the Canopy figure that hives off in real time online in the online grouping method of the present invention

Embodiment

The online grouping method of a kind of power customer based on the magnanimity continuous data as shown in Figures 1 to 4 comprises the steps:

Step 1: power consumer historical sample data are extracted;

The present invention will realize that one is hived off with the cluster of electrical feature based on the big customer, need to from metering automation system and marketing management system, extract some and can reflect that the client uses the data of electrical feature, so except customer profile class data, also need extract client's electric weight class data, the client class data of loading, specifically comprise:

Customer profile information: stoichiometric point numbering, electricity consumption classification, category of employment, electric pressure etc.

Client's information about power: total electricity consumption, peak power consumption, flat power consumption, paddy power consumption etc.

Client's information on load: electric current, voltage, power factor, active power, every information on loads of 15 minutes etc.

Step 2: the sample data that extracts is carried out pre-service;

Described step 2 comprises following substep:

The data pre-service mainly comprises missing values processing, outlier processing, the data of hiving off calculating etc.

S2.1: missing values is processed

At original continuous data, particularly in the Real-time Load data pick-up process, find to exist the phenomenon of disappearance, for guaranteeing the validity of modeling data, need to carry out polishing to these missing datas and process.Rule is mainly by of the same type day data and processes in conjunction with interpolation algorithm.

S2.2: outlier processing

To exceeding the data of index threshold values scope, carry out correcting process by of the same type day data in conjunction with interpolation algorithm.

S2.3: the computational data index of hiving off

Consider that load fluctuation can characterize client's the electrical feature of use substantially, so calculate the index that (for example: upper one month every day is average) in the certain hour section reflects the load change situation based on electric weight and load index:

Rate of load condensate=average load/peak load

Peak total ratio=peak electric weight/total electric weight

Flat always than the electric weight of=ordinary telegram amount/always

Paddy is always than=paddy electric weight/total electric weight

Wherein:

Peak load=M α x (L _i), i=1,2 ..., 96, L _iExpression was every 15 minutes power load sampled value.

Peak, paddy, ordinary telegram amount are respectively the power consumption of city peak of power consumption time period, flat peak time section and paddy peak time section.

Rush hour, section referred to the peak of power consumption, and power consumption is relatively concentrated, and the low ebb time period is then opposite.Rush hour, section was 8 hours: 9:00～12:00,17:00～22:00; Flat 7 hours time periods of section: 8:00～9:00,12:00～17:00,22:00～23:00; 9 hours low ebb time periods: 23:00～next day 8:00.

S2.4: data normalization

In order to eliminate the otherness of hiving off between the index dimension, data are carried out normalized, main method can adopt the minimax value method, zero-mean method and decimal scaling method, each index standard is arrived unified scope, the below is take the minimax value method as example, and each data all can be normalized within [0,1] scope.

Described step 3 comprises following substep:

S3.1: sample data standardization

Data normalization refers to changing into vector data through a pretreated sample data, vector data comprise the big customer rate of load condensate, peak total ratio, flat always than and paddy always than.

Wherein d1 is rate of load condensate, and d2 is peak total ratio, d3 for flat always than, d4 be paddy always than.

Vector data is stored in the distributed file system, in standardized process, can pass through the MapReduce scheduler, according to the sample data file size split into some data block vector data angang than and paddy always than etc.Quantity according to data block starts Map tasks in parallel operative norm conversion work, sees Fig. 2.

S3.2: distributed storage

Distributed file system adopts the master/slave framework.HDFS cluster is comprised of the Datanodes of a Namenode and some.Namenode is a central server, is in charge of the name space (namespace) of file system and client to the access of file.Datanode in the cluster is one of a node, is in charge of the storage on its place node.HDFS has opened the name space of file system, and the user can store data in the above with the form of file.See that internally a file is divided into one or more data blocks in fact, these pieces are stored on one group of Datanode.The operation of the name space of Namenode execute file system, such as open, close, Rename file or catalogue.It also is responsible for the specified data piece to the mapping of concrete Datanode node.Datanode is responsible for processing the read-write requests of file system client.Under the United Dispatching of Namenode, carry out data block establishment, delete and copy.

S3.3: cluster centre point initialization

Customer grouping mainly take can and the clustering algorithm of Distributed Calculation hive off, the below describes as an example of the K-means algorithm example.

Clustering algorithm at first generates empty cluster and numbering, concentrates from all sample datas and selects at random K object as the central point of K-means cluster, with the representative of cluster centre point as each cluster.

S3.4: iterative computation Optimal cluster centers point

By alternative manner, constantly calculate new cluster centre point, until all sample datas all and the distance between the central point minimum.

Here be divided into again for two steps:

The first step: calculate each sample data and belong to the cluster centre point.Namely calculate first each sample data to the distance of central point, then sample data is belonged to nearest cluster centre point.The main Map parallel method that adopts.The Map parallel method is the direct cutting of sample data, separately parallel computation of each cutting.Because do not need other sample datas when each sample data is calculated, therefore can walk abreast and carry out.

The cluster centre that is input as all sample datas to be clustered and last round of iteration (or initial clustering) of Map parallel method, input data recording＜key, value〉right form be＜line number, record is capable 〉; Each Map function reads in the cluster centre description document, and the Map function calculates apart from its nearest class center each sample data of input, and does the mark of new classification; Output intermediate result＜key, value〉right form be＜the cluster category IDs, record attribute is vectorial 〉.

Second step: the cluster centre point that recomputates each cluster.Namely for above-mentioned each cluster, calculate its center position, as new cluster centre point.The main Reduce parallel method that adopts.The Reduce method is returned transmission with the above-mentioned sample data that is distributed on each computing machine according to cluster position under it, and the sample data that is about in the identical cluster is sent on the same computer, to calculate the new central point of this cluster.

The computing formula of central point:

{\overset{&RightArrow;}{C}}_{new} = \frac{1}{n} Σ \overset{&RightArrow;}{p}

In the formula,

Refer to each dimension values of new central point, be respectively all sample datas in this cluster

In the arithmetic mean of this dimension values.

The task of Reduce function is that the intermediate result that obtains according to the Map function is calculated the cluster centre that makes new advances, for next round Map-Reduce Job. input data＜key, value〉right form is＜the cluster category IDs { record attribute vector set } 〉; The record that all key are identical (record that identical category ID is namely arranged) give a Reduce task--the identical some number of cumulative key and each record component and, ask the average of each component, obtain new cluster centre description document; Output rusults＜key, value〉right form is＜the cluster category IDs mean vector 〉.

The 3rd step: new and old cluster centre point position, determine whether convergence, such as convergence, then continue next step, otherwise, repeat S3.3.This process also adopts the Reduce method.Being about to new cluster centre point position data and old center position sends to same Reduce task and calculates.

Judge whether this cluster restrains: calculate cluster centre point distance before the last round of cluster centre that calculates and the beginning, if apart from less than given threshold value, think that then algorithm restrained end.Otherwise, then replace last round of cluster centre with the cluster centre of epicycle, and start the calculation task of a new round.

Fig. 3 is the process synoptic diagram of k-means clustering algorithm MapReduce implementation method deal with data.Before the Reduce task begins, can divide into groups take the key value as index and sort the intermediate result of Map tasks carrying node this locality, to improve the execution efficient of Reduce task.

S3.5: export the data of hiving off

In previous step, drawn the cluster centre point by iterative computation constantly, also drawn the cluster centre under each sample data simultaneously, can directly export and get final product.

After the historical sample data initialization hives off, according to the practical application needs, can regularly from the metering automation system, extract customer information and reflection client with the real-time electricity consumption data of electrical feature, as deleting the client who has left, add new client, upgrade client's new data etc.Data should be carried out pre-service according to the described method of step 2.

Described step 5 comprises following substep:

On the basis of the customer grouping that step 3 has generated, utilize K the cluster centre point that has generated, adopt the Canopy algorithm that newly-increased data are hived off online in real time.As shown in Figure 4.

The Canopy algorithm at first can require to input two threshold values T1 and T2, T1〉T2; Algorithm has a cluster the set (Set) of Canopy, and it is empty when just beginning; Then first that reads can be put as a Canopy in the set, then read next point, the distance of each Canopy in calculating this point and gathering, if this distance is less than T1, then this point can be distributed to point of this Canopy(and can distribute to a plurality of Canopy), and when this distance during less than T2 this point can not be put in the set as a new Canopy.

S5.3: calculate new sample data to the distance B of each central point according to the Canopy algorithm, when D＜T1, just this sample data is put in the corresponding cluster, when D＜T2, then this sample data is deleted from new sample set, if D1-DK is〉T1, then this point can originally be generated as a new central point, thereby forms new cluster (customer group); Cycle calculations is until all new samples data sets are sky;

S5.4: new central point is added in the former K-means central point, as the central point of online cluster next time;

S5.5: through long all after dates, the central point that calculates like this can be inaccurate, comprehensively recomputates so need to re-use step 5 pair all data, can improve existing central point as the cluster initial center point speed of convergence.

Step 6: to the performance evaluation of hiving off in real time online of step 5 acquisition.

Adopt 2 host nodes in the research, 5 are calculated and memory node 1 data acquisition node.Data volume is 74,920,323 records, takies disk space 2.5G, adopts the K-means cluster to carry out iterative computation 10 times, about 150 minutes consuming time; The online cluster time is about 9.39 minutes; As seen the online cluster grouping method that proposes of this patent has guaranteed that the Canopy cluster of increment can carry out very fast, the new reflection client who collects uses the electric quantity data of electrical feature, can both be very fast be assigned to cluster under its, this also is the key point of the online cluster that proposes of the present invention.

The hive off difficult point implemented of power customer of the present invention is the online processing of mass data, because very huge for the cluster data that carries out customer grouping, although can adopt a large amount of computer resources that all data are carried out constantly cluster, thereby improve the accuracy of cluster and guarantee certain real-time, but obviously this is a very waste.And this patent to adopt scheme that online cluster is combined with K-means be exactly a very cheap and solution fast.Algorithm has avoided the mass data amount excessive effectively, and software and hardware requires high, the problem that the system resource occupancy is high.

Claims

1. online grouping method of the power customer based on the magnanimity continuous data may further comprise the steps:

Step 1: power consumer historical sample data are extracted;

Step 2: the sample data that extracts is carried out pre-service;

2. the online grouping method of the power customer based on the magnanimity continuous data according to claim 1, it is characterized in that: the method comprises that also described step 1 comprises step 6: the performance evaluation of hiving off in real time online that step 5 is obtained.

3. the online grouping method of the power customer based on the magnanimity continuous data according to claim 1 and 2, it is characterized in that: the power consumer historical sample data in the described step 1 advance to comprise customer profile information, client's information about power and client's information on load;

S2.1: missing values is processed

S2.2: outlier processing

S2.3: the computational data index of hiving off

Rate of load condensate=average load/peak load

Peak total ratio=peak electric weight/total electric weight

Flat always than the electric weight of=ordinary telegram amount/always

Paddy is always than=paddy electric weight/total electric weight

Wherein:

S2.4: data normalization

Described step 3 comprises following substep:

S3.1: sample data standardization

S3.2: distributed storage

S3.3: cluster centre point initialization

S3.4: iterative computation Optimal cluster centers point

S3.5: export the data of hiving off

Described step 5 comprises following substep: