CN104850629A

CN104850629A - Analysis method of massive intelligent electricity-consumption data based on improved k-means algorithm

Info

Publication number: CN104850629A
Application number: CN201510263237.3A
Authority: CN
Inventors: 周天和; 卢晓飞; 张元元; 蔡荣
Original assignee: HANGZHOU TIANKUAN TECHNOLOGY Co Ltd
Current assignee: HANGZHOU TIANKUAN TECHNOLOGY Co Ltd
Priority date: 2015-05-21
Filing date: 2015-05-21
Publication date: 2015-08-19

Abstract

The present invention relates to an analysis method of massive intelligent electricity-consumption data based on an improved k-means algorithm. The analysis method comprises: firstly, establishing a Map-Reduce parallel processing model of the data of electricity-consumption information of residential consumers, such as the number of residential consumers, house areas, the number of family members, daily electricity consumption, the peak of electricity consumption, and the number of household electrical appliances; using the k-means algorithm to comprehensively consider two factors of the selection of an initial clustering center and the selection of the number of clusters, to improve the improved k-means algorithm, wherein the density of data objects is used as a selection standard of the initial clustering center, and a distance between clusters and the dispersing degree of objects in the clusters are used as the important references to select the number of clusters; optimizing the initial clustering center under the Map-Reduce parallel processing model to accurately position the clustering center; performing parallel mining on the data which belongs to each cluster to complete the analysis of electricity-consumption data. Through experiments on Hadoop clusters, results prove that the method provided by the present invention is stable, efficient and feasible to operate, and has a high speed-up ratio.

Description

A kind of magnanimity intelligent power data analysing method based on improving k-means algorithm

Technical field

The present invention relates to intelligent power grid technology field, particularly relating to a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm.

Background technology

In recent years, in the face of electricity needs and the day by day nervous electric power supply of the strong growth of intelligent power field, and the energy-saving and emission-reduction task improved constantly, the intelligent grid turning to basic technical features with informationization, robotization, interaction becomes worldwide study hotspot.In China, the scope that power supply enterprise's power information collection covers specially becomes user from only covering important special line, expand to cover all kinds of special line gradually and specially become the multiple electric field such as user, general industrial and commercial producer, low pressure resident, the scale accessing all kinds of acquisition terminal and table meter also increases thereupon, wants the electricity consumption data volume exponentially level growth of sampling and processing every day.Relevance is there is between data, and in data, hide the electricity consumption behavioural habits of user, explore effective data mining algorithm, user is segmented, for the user of different electricity consumption classification, excavate a large amount of valuable information such as electricity consumption behavior, electricity consumption quickly and accurately, that instructs in the work such as user's reducing energy consumption is passive and delayed, support intelligent business diagnosis and decision-making, become the problem that intelligent power field must solve.Cluster analysis, as a kind of data mining algorithm be widely used, can obtain the distribution characteristics of data in global scope with higher treatment effeciency, and progressively be applied to intelligent power field.But there is the bottleneck that operand is large, treatment effeciency is low in the face of magnanimity intelligent power data in large scale in traditional clustering method, can not the efficient excavation demand of satisfying magnanimity intelligent power data.

Summary of the invention

The present invention overcomes above-mentioned weak point, object is to provide a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm, Map-Reduce parallel process model is set up for domestic consumer, using the size of object densities, bunch spacing, bunch in the degree of scatter of object as important references, choose reasonable meets cluster number k and the initial cluster center of the electricity consumption data of P mining demand, under Map-Reduce model, carry out P mining analysis to magnanimity domestic consumer electricity consumption data.This method is efficient, feasible, and has good speed-up ratio, can excavate the potential valuable information of magnanimity intelligent power data accurately and efficiently, for the formulation of optimum electricity consumption strategy provides favourable guidance.

The present invention achieves the above object by the following technical programs: a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm, comprises the steps:

(1) Map-Reduce parallel process model is set up to domestic consumer, and with Map-Reduce parallel process model for data object creates magnanimity intelligent power data analysis framework;

(2) utilize and improve k-means algorithm, based on the size of object densities, bunch spacing, bunch in the degree of scatter of object select initial cluster center and cluster number k;

(3) under Map-Reduce parallel process model, initial cluster center is optimized, accurately locates cluster centre;

(4) to each cluster bunch belonging to data carry out P mining, complete electricity consumption data analysis.

As preferably, the Map-Reduce parallel process model of described step (1) comprise the domestic consumer number of domestic consumer's power information, floor space, kinsfolk's number, every day power consumption, peak-valley electric energy, household electrical appliance number six kinds of data.

As preferably, the magnanimity intelligent power data analysis framework of described step (1) adopts master/slave architecture mode, comprises data source, cloud computing main control server and cloud computing from server; Cloud computing main control server receives the magnanimity point data of data source and defines by the dimension of point data, and cloud computing carries out Parallel Algorithm for Mining calculating from server to the electricity consumption data after the definition of cloud computing main control server.

As preferably, the improvement k-means algorithm of described step (2) is the clustering algorithm belonging to division methods, adopt Euclidean distance as the evaluation index of 2 sample similarity degrees, k point of random selecting data centralization is as initial cluster center, each sample according to data centralization carries out cluster to the distance size of k initial cluster center, calculate all sample means be grouped in each cluster, upgrade each initial cluster center, until square error criterion function is stabilized in minimum value.

As preferably, the method for the selection initial cluster center of described step (2) is as follows:

1) according to the distance d (x between 2 objects in following formula calculating object set M _i, x _j);

d(x _i,x _j)＝[(x _i1-x _j1) ²+(x _i2-x _j2) ²+…+(x _in-x _jn) ²]

2) according to the mean distance MeanDis (M) between all objects in following formula calculating object set M;

MeanDis (M) = \frac{1}{n (n - 1)} Σd (x_{i}, x_{j})

3) according to following formula calculating object x _idensity D en (x _i);

Den (x_{i}) = Σ_{j = 1}^{n} u (MeanDis (M) - d (x_{i}, x_{j}))

Wherein when x >=0, u (x)=1, otherwise u (x)=0;

4) density collection D={Den (x is obtained from above ₁), Den (x ₂) ..., Den (x _n), elect object maximum for density set D Midst density as the 1st initial cluster center O ₁, the object selecting density second largest is as the 2nd primary election cluster centre O ₂, the rest may be inferred, will meet as shown in the formula subconditional object y _ias a kth cluster centre, until reach predetermined cluster numbers;

max(min(d(y _i,O ₁)),…,min(d(y _i,O _n-1)))。

As preferably, described step (2) utilizes to improve the minimum δ of k-means Algorithm for Solving to determine cluster number k, and computing formula is as follows:

Beneficial effect of the present invention is: (1) is simple based on improvement k-means Algorithm Analysis, fast convergence rate, improves the treatment effeciency of magnanimity electricity consumption data, shortens computing time; (2) this method has good speed-up ratio, can excavate the potential valuable information of magnanimity intelligent power data accurately and efficiently, for the formulation of optimum electricity consumption strategy provides favourable guidance.

Accompanying drawing explanation

Fig. 1 is the magnanimity intelligent power data analysis configuration diagram based on cloud computing;

Fig. 2 is the k-means Parallel Algorithms for Data Mining FB(flow block) improved.

Embodiment

Below in conjunction with specific embodiment, the present invention is described further, but protection scope of the present invention is not limited in this:

Embodiment: a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm, comprises the steps:

(1) first to domestic consumer set up comprise domestic consumer's power information domestic consumer number, floor space, kinsfolk's number, every day power consumption, peak-valley electric energy, household electrical appliance number etc. the Map-Reduce parallel process model of data, and with Map-Reduce parallel process model for data object creates magnanimity intelligent power data analysis framework;

Magnanimity intelligent power data analysis framework as shown in Figure 1, adopts master/slave architecture mode to realize the storage of power consumer magnanimity intelligent power data, and realizes the mining analysis of the potential valuable information of magnanimity intelligent power data based on Parallel Algorithm for Mining.Main control server receives the magnanimity electricity consumption data of the peripheral systems such as power information acquisition system, sales service application system, intelligent residents ' area management system and enterprise energy management system, the dimension that electricity consumption data comprise is defined, by valuable dimension combination composition data dimensional model, electricity consumption Data distribution8 stores to corresponding cloud computing from the data memory module server according to data dimension model by data management module, and records and be stored in from information such as the data block server and memory locations.When there being mass data analysis request, main control server receives request, and analyzes request, according to actual conditions, from data mining model storehouse, select one or more to be suitable for the Parallel Algorithm for Mining of this request, the task matching after dimension decomposition is given from server; Each from server according to being assigned to of task, data memory module and task execution module cooperatively interact, the Parallel Algorithm for Mining using main control server to select performs electricity consumption data mining task, realizes the analysis of magnanimity electricity consumption data, obtains valuable information fast and efficiently.

Improving k-means algorithm is a kind of clustering algorithm belonging to division methods, usual employing Euclidean distance is as the evaluation index of 2 sample similarity degrees, its basic thought is: k point of random selecting data centralization is as initial cluster center, each sample according to data centralization carries out cluster to the distance size of k initial cluster center, calculate all sample means be grouped in each cluster, upgrade each initial cluster center, until square error criterion function is stabilized in minimum value.

Object set M={x ₁, x ₂..., x _n, x _i=(x _i1, x _i2..., x _it), sample x _iwith sample x _jeuclidean distance computing formula as follows:

d(x _i,x _j)＝[(x _i1-x _j1) ²+(x _i2-x _j2) ²+…+(x _in-x _jn) ²] (1)

Average criteria error function is as follows:

I_{C} = Σ_{i = 1}^{k} Σ_{j = 1}^{t_{i}} {| | x_{i} - n_{i} | |}^{2} - - - (2)

In formula: k is the number wanting cluster; t _iit is the number of sample in the i-th class; n _iit is the average of sample in the i-th class.

The k-means algorithm improved is simple, fast convergence rate, easily extensible and efficiency is high, but this algorithm exists cluster number determines difficulty, and initial cluster center is chosen the inaccurate cluster result that causes and is easily absorbed in the defects such as locally optimal solution.Therefore, the present invention is directed to this defect, based on Map-Reduce parallel process model, propose the mass data Parallel Algorithm for Mining improving k-means algorithm, realize the analysis of domestic consumer's mass data for target computing time with the execution efficiency, the shortening that improve algorithm.

The selection of (a) k-means initial cluster center

Euclidean distance between data object is less, and its similarity is larger.If the data object in region residing for a certain data object of data centralization is more, other object in region is less to the distance of this object, illustrate that the packing density of this object is larger, data distribution characteristics can be reflected preferably, more be conducive to the convergence of square error criterion function as cluster centre, and the randomness selected due to initial cluster center can be avoided to bring result to be absorbed in the deadlock of local optimum.

Distance d (the x between 2 objects is calculated according to formula (1) _i, x _j).

Mean distance MeanDis (M) according to the following formula in calculating object set M between all objects.

MeanDis (M) = \frac{1}{n (n - 1)} Σd (x_{i}, x_{j}) - - - (3)

Calculating object x according to the following formula _idensity D en (x _i).

Den (x_{i}) = Σ_{j = 1}^{n} u (MeanDis (M) - d (x_{i}, x_{j})) - - - (4)

Wherein when x >=0, u (x)=1, otherwise u (x)=0.

Density set D={Den (x ₁), Den (x ₂) ..., Den (x _n).

Elect object maximum for density set D Midst density as the 1st initial cluster center O ₁, the object selecting density second largest is as the 2nd primary election cluster centre O ₂, the rest may be inferred, will meet max (min (d (y _i, O ₁)) ..., min (d (y _i, O _n-1))) the object y of condition _i(y _iaccording to the descending selection of the density of object, each n that selects is individual) as a kth cluster centre, until reach predetermined cluster numbers.

Choosing of (b) cluster number

In bunch, object degree of scatter is less, and the distance between bunch is larger, and Clustering Effect is better.The number of preliminary supposition bunch is k, according to the center of (a) method choice each bunch, using the mean distance between data object in the i-th bunch as the degree of scatter of data object, uses p _irepresent, represent the distance between the i-th bunch and individual bunch of jth, reset the value of k, repeat (a), select the k making following formula value minimum as best cluster number.

In the present embodiment, as shown in Figure 2, excavate based on the domestic consumer's electricity consumption data parallel improving k-means algorithm and be divided into nine steps, particularly:

1. power information data set is stored in distributed file system with row form, makes pending electricity consumption data set can form electricity consumption data subset by burst by row.Task management determines cluster number k and an initialization k cluster centre according to this section to improving one's methods of k-means algorithm, is sent to the node of m pending Map task.

2. the magnanimity intelligent power data subset in distributed file system is formatd, produce <key ₁, value ₁> key-value pair, is specifically formatted as <UserID, info>, and UserID represents domestic consumer ID here, and info is domestic consumer's information, comprises floor space, kinsfolk's number, power consumption etc.

3. each record <UserID, the info> of domestic consumer electricity consumption data subset of Map function to input scans, and calculates the Euclidean distance of itself and k central point respectively, the central point kmin that recording distance is minimum.Map function generates and exports intermediate result <key ₂, value ₂> couple, is defined as <kmin here, info ₁> couple, kmin represent the mark of affiliated bunch of this user, info ₁comprise the information such as domestic consumer ID, floor space, kinsfolk's number, power consumption.

4. intermediate result is carried out Hash according to kmin by partition functions Partitioner, is divided into the individual different subregion of r, each subregion is assigned to the Reduce function of specifying.

5. the node being assigned with Reduce task reads corresponding intermediate result <kmin, list< (info from m Map task ₁<userID ₁... >), (info ₁<userID ₂>), >> also sorts to data according to kmin, make to have the data gathering of identical kmin together, the intermediate data after node traverses sequence, by <kmin, list> passes to Reduce function, then Reduce function is according to list value, calculates the mean vector with identical kmin data, upgrades the central point of corresponding bunch of kmin.

6. repeat 2.-5., until square error criterion function is stabilized in minimum value, export the data of k bunch respectively.

7. for the data set of the i-th bunch, re-start format, produce key-value to <EC, list ₁>, wherein EC is user power utilization amount, list ₁for other information of user.Dividing data integrates as data subset, transfers to m the node processing performing Map task.

8. the data subset of Map function to input scans, and generates intermediate result <EC, list ₁>, the intermediate result that each Map task produces will be divided into r subregion p according to partition functions Partitioner by EC value ₁, p ₂..., p _r, Map task can respectively to each subregion p simultaneously _i(1≤i≤r) sorts by EC value, and r Reduce task obtains corresponding intermediate result respectively, and then carries out the data processing in Reduce stage.For ensureing that final Output rusults is by EC value global orderly, the cryptographic hash that Partitioner no longer presses EC divides intermediate result, but the scope of employing division principle, make the intermediate result [EC in each Reduce task process EC segment _i, EC _j].

9. after each Reduce task reads the intermediate result of respective partition from m Map task, Map-Reduce internal mechanism carries out merger (merge) operation to these data, thus make the output of Reduce task be that the data set of the i-th bunch is by one section of continuous print data subset after power consumption sequence, the output of all Reduce tasks is merged according to the order of sequence, obtains the ordered data collection by power consumption sequence in the i-th bunch.

The know-why being specific embodiments of the invention and using described in above, if the change done according to conception of the present invention, its function produced do not exceed that instructions and accompanying drawing contain yet spiritual time, must protection scope of the present invention be belonged to.

Claims

1. the magnanimity intelligent power data analysing method based on improvement k-means algorithm, it is characterized in that comprising the steps: that (1) sets up Map-Reduce parallel process model to domestic consumer, and with Map-Reduce parallel process model for data object creates magnanimity intelligent power data analysis framework;

2. a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm according to claim 1, is characterized in that: the Map-Reduce parallel process model of described step (1) comprise the domestic consumer number of domestic consumer's power information, floor space, kinsfolk's number, every day power consumption, peak-valley electric energy, household electrical appliance number six kinds of data.

3. a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm according to claim 1, it is characterized in that: the magnanimity intelligent power data analysis framework of described step (1) adopts master/slave architecture mode, comprises data source, cloud computing main control server and cloud computing from server; Cloud computing main control server receives the magnanimity point data of data source and defines by the dimension of point data, and cloud computing carries out Parallel Algorithm for Mining calculating from server to the electricity consumption data after the definition of cloud computing main control server.

4. a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm according to claim 1, it is characterized in that: the improvement k-means algorithm of described step (2) is a kind of clustering algorithm belonging to division methods, adopt Euclidean distance as the evaluation index of 2 sample similarity degrees, k point of random selecting data centralization is as initial cluster center, each sample according to data centralization carries out cluster to the distance size of k initial cluster center, calculate all sample means be grouped in each cluster, upgrade each initial cluster center, until square error criterion function is stabilized in minimum value.

5. a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm according to claim 1, is characterized in that: the method for the selection initial cluster center of described step (2) is as follows:

d(x _i,x _j)＝[(x _i1-x _j1) ²+(x _i2-x _j2) ²+…+(x _in-x _jn) ²]

MeanDis (M) = \frac{1}{n (n - 1)} Σd (x_{i}, x_{j})

3) according to following formula calculating object x _idensity D en (x _i);

Den (x_{i}) = Σ_{j = 1}^{n} u (MeanDis (M) - d (x_{i}, x_{j}))

Wherein when x >=0, u (x)=1, otherwise u (x)=0;

max(min(d(y _i,O ₁)),…,min(d(y _i,O _n-1)))。

6. a kind of magnanimity intelligent power data analysing method based on improving k-means algorithm according to claim 1, it is characterized in that: described step (2) utilizes to improve the minimum δ of k-means Algorithm for Solving to determine cluster number k, and computing formula is as follows: